Sequence Analysis

samtools faidx
Function: Index or query the sequences of regions from a fasta file
Usage: samtools faidx [options] <ref.fasta> [region1 [...]]
Supported input format: FASTA
faCount
Function: Count base statistics and CpGs in fasta files.
Usage: faCount file(s).fa
faTrans
Function: Translate DNA sequences in a FASTA file to peptides
Usage: faTrans [options] in.fa out.fa
MAFFT
Function: MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <∼200 sequences), FFT-NS-2 (fast; for alignment of <∼30,000 sequences), etc.
Usage: mafft [arguments] input > output
faSplit
Function: Split a fasta file into several files.
Usage: faSplit how input.fa count outRoot
FIMO
Function: FIMO scans a sequence database for individual matches to each of the motifs you provide (sample output for motifs and sequences).
Usage: fimo [options] <motifs> <database>
faFilter
Function: Filter fasta records, selecting ones that match the specified conditions
Usage: faFilter [options] in.fa out.fa
Supported input format: FASTA
plot_len.pl
Function: This is a script to print out distributions of clusters & sequences.
Usage: plot_len.pl input.clstr 1,2-4,5-9,10-19,20-49,50-99,100-299,500-99999 10-59,60-149,150-499,500-1999,2000-999999
CD-HIT-2D
Function: CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.
Usage: cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.9 -n 5
CD-HIT-EST
Function: CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST.
Usage: cd-hit-est -i est_human -o est_human95 -c 0.95 -n 8
make_multi_seq.pl
Function: This script reads the .clstr file, it generates a separate fasta file for each cluster over certain size and saves it in designated subdirectory. To run this script correctly, ”-d 0” option should be used in the cd-hit run and it is better to use ”-g 1” in the cd-hit run to get accurate clustering results.
Usage: make_multi_seq.pl seq_db dbout.clstr multi-seq 20
faSize
Function: Print total base count and related statistics for sequences stored in FASTA files.
Usage: faSize file(s).fa
Supported input format: FASTA
faOneRecord
Function: Extract a single record from a fasta file
Usage: faOneRecord in.fa recordName
Supported input format: FASTA
CD-HIT-EST-2D
Function: CD-HIT-EST-2D compares 2 nucleotide datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two DNA/RNA datasets (db1, db2) in fasta format and the output are two files: a fasta file of sequences in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2. For same reason as CD-HIT-EST, CD-HIT-EST-2D is good for non-intron containing sequences like EST.
Usage: cd-hit-est-2d -i mrna_human -i2 est_human -o est_human_novel -c 0.95 -n 8
clstr_sort_prot_by.pl
Function: This script sort sequences within clusters in .clstr file by length, name, etc.
Usage: Clstr_sort_prot_by.pl input.clstr id > input_sort.clstr