CD-HIT-EST

Reference

CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST.

Usage

cd-hit-est -i est_human -o est_human95 -c 0.95 -n 8

Manual

Choose of word size:

-n 8,9,10 for thresholds 0.90 ~ 1.0
-n 7      for thresholds 0.88 ~ 0.9
-n 6      for thresholds 0.85 ~ 0.88
-n 5      for thresholds 0.80 ~ 0.85
-n 4      for thresholds 0.75 ~ 0.8

More options:

Options, -b, -M, -l, -d, -t, -s, -S, -B, -p, -aL, -AL, -aS, -AS, -g, -G, -T are same to CD-HIT, here are few more cd-hit-est specific options:

-r 1 or 0, default 0, if set to 1, comparing both strand (++, +-)

CD-HIT-EST

Category

Usage

Manual

Share your experience or ask a question