Sequence Analysis

clstr_sort_prot_by.pl: Function: This script sort sequences within clusters in .clstr file by length, name, etc.

Usage: Clstr_sort_prot_by.pl input.clstr id > input_sort.clstr
PSI-CD-HIT: Function: PSI-CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, which can be identity or expect value. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the outputs are two files: a fasta file of representative sequences and a text file of list of clusters

Usage: psi-cd-hit.pl -i nr60 -o nr30 -c 0.3
CD-HIT-2D-PARA: Function: CD-HIT-2D-PARA is a script that runs cd-hit-2d, cd-hit-est-2d in a parallel mode. It splits the input databases; runs cd-hit-2d or cd-hit-est-2d in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit-2d or cd-hit-est-2d. The input is a protein or DNA/RAN dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.

Usage: cd-hit-para.pl -i nr -i2 swissprot -o swissprot_vs_nr -c 0.6 -n 4 --Q 20 -T "SGE" --S 2 --S2 20
clstr_renumber.pl: Function: It renumbers clusters and sequences within clusters in .clstr file after merge or other operations

Usage: Clstr_renumber.pl input.clstr > input_ren.clstr
faFrag: Function: Extract a piece of DNA from a fasta file.

Usage: faFrag [options] in.fa start end out.fa

Supported input format: FASTA
CD-HIT-PARA: Function: CD-HIT-PARA is a script that runs cd-hit, cd-hit-est in a parallel mode. It splits the input database; runs cd-hit or cd-hit-est in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit or cd-hit-est. The input is a protein or DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.

Usage: cd-hit-para.pl -i nr90 -o nr60 -c 0.6 -n 4 --B hosts --S 64
clstr2xml.pl: Function: This script converts a cluster file or combines multiple cluster files from a hierarchical cd-hit run to xml format. The output is sorted by sequence length (default) or cluster size. The input cluster files must be in the order of being generated, that is, the cluster file with higher identity cutoff comes first.

Usage: clstr2xml.pl [-len|-size] input1.clstr [input2.clstr input3.clstr ...]
CD-HIT: Function: CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.

Usage: cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 2000
hgGcPercent: Function: Calculate the GC (Guanine-Cytosine) percentage in windows of a specified size across a genome sequence

Usage: hgGcPercent [options] database nibDir

Supported input format: 2bit, nib
bedtools nuc: Function: Profile the nucleotide content of intervals in a fasta file

Usage: bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Supported input format: BED, GFF, GTF, VCF
ame: Function: Identify motifs that are enriched in your sequences compared to control sequences.

Usage: ame [options] <sequence_file> <motif_file>+
tomtom: Function: Tomtom compares one or more motifs against a database of known motifs (e.g., JASPAR). Tomtom will rank the motifs in the database and produce an alignment for each significant match (sample output for motif and JASPAR CORE 2014 database).

Usage: tomtom [options] <query file> <target file>+