Fork me on GitHub
Software Parameter Function More
plot_len.pl plot_len.pl input.clstr 1,2-4,5-9,10-19,20-49,50-99,100-299,500-99999 10-59,60-149,150-499,500-1999,2000-999999 This is a script to print out distributions of clusters & sequences. Show
make_multi_seq.pl make_multi_seq.pl seq_db dbout.clstr multi-seq 20 This script reads the .clstr file, it generates a separate fasta file for each cluster over certain size and saves it in designated subdirectory. To run this script correctly, ”-d 0” option should be used in the cd-hit run and it is better to use ”-g 1” in the cd-hit run to get accurate clustering results. Show
FIMO fimo [options] <motifs> <database> FIMO scans a sequence database for individual matches to each of the motifs you provide (sample output for motifs and sequences). Show
MAFFT mafft [arguments] input > output MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <∼200 sequences), FFT-NS-2 (fast; for alignment of <∼30,000 sequences), etc. Show
CD-HIT-2D cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.9 -n 5 CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2. Show
clstr_sort_prot_by.pl Clstr_sort_prot_by.pl input.clstr id > input_sort.clstr This script sort sequences within clusters in .clstr file by length, name, etc. Show
CD-HIT-2D-PARA cd-hit-para.pl -i nr -i2 swissprot -o swissprot_vs_nr -c 0.6 -n 4 --Q 20 -T "SGE" --S 2 --S2 20 CD-HIT-2D-PARA is a script that runs cd-hit-2d, cd-hit-est-2d in a parallel mode. It splits the input databases; runs cd-hit-2d or cd-hit-est-2d in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit-2d or cd-hit-est-2d. The input is a protein or DNA/RAN dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Show
clstr_ renumber.pl Clstr_renumber.pl input.clstr > input_ren.clstr It renumbers clusters and sequences within clusters in .clstr file after merge or other operations Show
clstr2xml.pl clstr2xml.pl [-len|-size] input1.clstr [input2.clstr input3.clstr ...] This script converts a cluster file or combines multiple cluster files from a hierarchical cd-hit run to xml format. The output is sorted by sequence length (default) or cluster size. The input cluster files must be in the order of being generated, that is, the cluster file with higher identity cutoff comes first. Show
CD-HIT-PARA cd-hit-para.pl -i nr90 -o nr60 -c 0.6 -n 4 --B hosts --S 64 CD-HIT-PARA is a script that runs cd-hit, cd-hit-est in a parallel mode. It splits the input database; runs cd-hit or cd-hit-est in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit or cd-hit-est. The input is a protein or DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Show
CD-HIT-EST-2D cd-hit-est-2d -i mrna_human -i2 est_human -o est_human_novel -c 0.95 -n 8 CD-HIT-EST-2D compares 2 nucleotide datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two DNA/RNA datasets (db1, db2) in fasta format and the output are two files: a fasta file of sequences in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2. For same reason as CD-HIT-EST, CD-HIT-EST-2D is good for non-intron containing sequences like EST. Show
CD-HIT cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 2000 CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Show
CD-HIT cd-hit -i db -o db90 -c 0.9 -n 5 CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Show
PSI-CD-HIT psi-cd-hit.pl -i nr60 -o nr30 -c 0.3 PSI-CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, which can be identity or expect value. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the outputs are two files: a fasta file of representative sequences and a text file of list of clusters Show
CD-HIT-EST cd-hit-est -i est_human -o est_human95 -c 0.95 -n 8 CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST. Show