Sequence Analysis

clstr_sort_prot_by.pl
Function: This script sort sequences within clusters in .clstr file by length, name, etc.
Usage: Clstr_sort_prot_by.pl input.clstr id > input_sort.clstr
CD-HIT-2D-PARA
Function: CD-HIT-2D-PARA is a script that runs cd-hit-2d, cd-hit-est-2d in a parallel mode. It splits the input databases; runs cd-hit-2d or cd-hit-est-2d in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit-2d or cd-hit-est-2d. The input is a protein or DNA/RAN dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.
Usage: cd-hit-para.pl -i nr -i2 swissprot -o swissprot_vs_nr -c 0.6 -n 4 --Q 20 -T "SGE" --S 2 --S2 20
clstr_renumber.pl
Function: It renumbers clusters and sequences within clusters in .clstr file after merge or other operations
Usage: Clstr_renumber.pl input.clstr > input_ren.clstr
PSI-CD-HIT
Function: PSI-CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, which can be identity or expect value. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the outputs are two files: a fasta file of representative sequences and a text file of list of clusters
Usage: psi-cd-hit.pl -i nr60 -o nr30 -c 0.3
CD-HIT-PARA
Function: CD-HIT-PARA is a script that runs cd-hit, cd-hit-est in a parallel mode. It splits the input database; runs cd-hit or cd-hit-est in parallel on a computer cluster; and finally merges the outputs into a single file. You can run it as you run cd-hit or cd-hit-est. The input is a protein or DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.
Usage: cd-hit-para.pl -i nr90 -o nr60 -c 0.6 -n 4 --B hosts --S 64
faFrag
Function: Extract a piece of DNA from a fasta file.
Usage: faFrag [options] in.fa start end out.fa
Supported input format: FASTA
clstr2xml.pl
Function: This script converts a cluster file or combines multiple cluster files from a hierarchical cd-hit run to xml format. The output is sorted by sequence length (default) or cluster size. The input cluster files must be in the order of being generated, that is, the cluster file with higher identity cutoff comes first.
Usage: clstr2xml.pl [-len|-size] input1.clstr [input2.clstr input3.clstr ...]
CD-HIT
Function: CD-HIT clusters proteins into clusters that meet a user-defined similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters.
Usage: cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 2000
hgGcPercent
Function: Calculate the GC (Guanine-Cytosine) percentage in windows of a specified size across a genome sequence
Usage: hgGcPercent [options] database nibDir
Supported input format: 2bit, nib
bedtools nuc
Function: Profile the nucleotide content of intervals in a fasta file
Usage: bedtools nuc [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>
Supported input format: BED, GFF, GTF, VCF
ame
Function: Identify motifs that are enriched in your sequences compared to control sequences.
Usage: ame [options] <sequence_file> <motif_file>+
tomtom
Function: Tomtom compares one or more motifs against a database of known motifs (e.g., JASPAR). Tomtom will rank the motifs in the database and produce an alignment for each significant match (sample output for motif and JASPAR CORE 2014 database).
Usage: tomtom [options] <query file> <target file>+