Sequence Analysis

samtools faidx: Function: Index or query the sequences of regions from a fasta file

Usage: samtools faidx [options] <ref.fasta> [region1 [...]]

Supported input format: FASTA
faCount: Function: Count base statistics and CpGs in fasta files.

Usage: faCount file(s).fa
faSplit: Function: Split a fasta file into several files.

Usage: faSplit how input.fa count outRoot
MAFFT: Function: MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <âˆ¼200 sequences), FFT-NS-2 (fast; for alignment of <âˆ¼30,000 sequences), etc.

Usage: mafft [arguments] input > output
faTrans: Function: Translate DNA sequences in a FASTA file to peptides

Usage: faTrans [options] in.fa out.fa
FIMO: Function: FIMO scans a sequence database for individual matches to each of the motifs you provide (sample output for motifs and sequences).

Usage: fimo [options] <motifs> <database>
faFilter: Function: Filter fasta records, selecting ones that match the specified conditions

Usage: faFilter [options] in.fa out.fa

Supported input format: FASTA
CD-HIT-EST: Function: CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST.

Usage: cd-hit-est -i est_human -o est_human95 -c 0.95 -n 8
CD-HIT-2D: Function: CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.

Usage: cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.9 -n 5
faSize: Function: Print total base count and related statistics for sequences stored in FASTA files.

Usage: faSize file(s).fa

Supported input format: FASTA
plot_len.pl: Function: This is a script to print out distributions of clusters & sequences.

Usage: plot_len.pl input.clstr 1,2-4,5-9,10-19,20-49,50-99,100-299,500-99999 10-59,60-149,150-499,500-1999,2000-999999
make_multi_seq.pl: Function: This script reads the .clstr file, it generates a separate fasta file for each cluster over certain size and saves it in designated subdirectory. To run this script correctly, ”-d 0” option should be used in the cd-hit run and it is better to use ”-g 1” in the cd-hit run to get accurate clustering results.

Usage: make_multi_seq.pl seq_db dbout.clstr multi-seq 20
CD-HIT-EST-2D: Function: CD-HIT-EST-2D compares 2 nucleotide datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two DNA/RNA datasets (db1, db2) in fasta format and the output are two files: a fasta file of sequences in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2. For same reason as CD-HIT-EST, CD-HIT-EST-2D is good for non-intron containing sequences like EST.

Usage: cd-hit-est-2d -i mrna_human -i2 est_human -o est_human_novel -c 0.95 -n 8
faOneRecord: Function: Extract a single record from a fasta file

Usage: faOneRecord in.fa recordName

Supported input format: FASTA
PSI-CD-HIT: Function: PSI-CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, which can be identity or expect value. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the outputs are two files: a fasta file of representative sequences and a text file of list of clusters

Usage: psi-cd-hit.pl -i nr60 -o nr30 -c 0.3