Sequence Analysis

CD-HIT
Function: CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.
Usage: cd-hit -i db -o db90 -c 0.9 -n 5
faFrag
Function: Extract a piece of DNA from a fasta file.
Usage: faFrag [options] in.fa start end out.fa
Supported input format: FASTA
CD-HIT-PARA
Function: CD-HIT-PARA is a script that runs cd-hit, cd-hit-est in a parallel mode. It splits the input database; runs cd-hit or cd-hit-est in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit or cd-hit-est. The input is a protein or DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.
Usage: cd-hit-para.pl -i nr90 -o nr60 -c 0.6 -n 4 --B hosts --S 64
CD-HIT-2D-PARA
Function: CD-HIT-2D-PARA is a script that runs cd-hit-2d, cd-hit-est-2d in a parallel mode. It splits the input databases; runs cd-hit-2d or cd-hit-est-2d in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit-2d or cd-hit-est-2d. The input is a protein or DNA/RAN dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.
Usage: cd-hit-para.pl -i nr -i2 swissprot -o swissprot_vs_nr -c 0.6 -n 4 --Q 20 -T "SGE" --S 2 --S2 20
clstr_sort_prot_by.pl
Function: This script sort sequences within clusters in .clstr file by length, name, etc.
Usage: Clstr_sort_prot_by.pl input.clstr id > input_sort.clstr
clstr2xml.pl
Function: This script converts a cluster file or combines multiple cluster files from a hierarchical cd-hit run to xml format. The output is sorted by sequence length (default) or cluster size. The input cluster files must be in the order of being generated, that is, the cluster file with higher identity cutoff comes first.
Usage: clstr2xml.pl [-len|-size] input1.clstr [input2.clstr input3.clstr ...]
clstr_renumber.pl
Function: It renumbers clusters and sequences within clusters in .clstr file after merge or other operations
Usage: Clstr_renumber.pl input.clstr > input_ren.clstr
CD-HIT
Function: CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.
Usage: cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 2000
hgGcPercent
Function: Calculate the GC (Guanine-Cytosine) percentage in windows of a specified size across a genome sequence
Usage: hgGcPercent [options] database nibDir
Supported input format: 2bit, nib
bedtools nuc
Function: Profile the nucleotide content of intervals in a fasta file
Usage: bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
Supported input format: BED, GFF, GTF, VCF
ame
Function: Identify motifs that are enriched in your sequences compared to control sequences.
Usage: ame [options] <sequence_file> <motif_file>+
tomtom
Function: Tomtom compares one or more motifs against a database of known motifs (e.g., JASPAR). Tomtom will rank the motifs in the database and produce an alignment for each significant match (sample output for motif and JASPAR CORE 2014 database).
Usage: tomtom [options] <query file> <target file>+