Reads Manipulation

Function: cd-hit-dup is a simple tool for removing duplicates from sequencing reads, with optional step to detect and remove chimeric reads.
Usage: cd-hit-dup -i input.fa -o output.fa [other options]
Function: This tool corrects the GC-bias using the method proposed by [Benjamini & Speed (2012). Nucleic Acids Research, 40(10)]. It will remove reads from regions with too high coverage compared to the expected values (typically GC-rich regions) and will add reads to regions where too few reads are seen (typically AT-rich regions). The tool computeGCBias needs to be run first to generate the frequency table needed here.
Usage: correctGCBias -b file.bam --effectiveGenomeSize 2150570000 -g mm9.2bit --GCbiasFrequenciesFile freq.txt -o gc_corrected.bam [options]
Function: CAP3 (Contig Assembly Program) is a DNA sequence assembly program for small-scale assembly with or without quality values.
Usage: cap3 input_reads.fasta [options] > output.txt
Function: Calculate the distributions of inserted nucleotides across reads.
Usage: -s "PE" -i test.bam -o out
maq fasta2bfa
Function: Convert sequences in FASTA format to Maq’s BFA (binary FASTA) format.
Usage: maq fasta2bfa in.ref.fasta out.ref.bfa
Function: cd-hit-dup is a simple tool for removing duplicates from sequencing reads, with optional step to detect and remove chimeric reads. A number of options are provided to tune how the duplicates are removed.
Usage: cd-hit-dup -i input.fa -o output
Function: This tool samples the given BAM files with paired-end data to estimate the fragment length distribution. Properly paired reads are preferred for computation, i.e., unless a region does not contain any concordant pairs, discordant pairs are ignored.
Usage: bamPEFragmentSize [-h] [--bamfiles bam files [bam files ...]] [--histogram FILE] [--plotFileFormat FILETYPE] [--numberOfProcessors INT] [--samplesLabel SAMPLESLABEL [SAMPLESLABEL ...]] [--plotTitle PLOTTITLE] [--maxFragmentLength MAXFRAGMENTLENGTH] [--logScale] [--binSize INT] [--distanceBetweenBins INT] [--blackListFileName BED file] [--table FILE] [--outRawFragmentLengths FILE] [--verbose] [--version]
maq sol2sanger
Function: Convert Solexa FASTQ to standard/Sanger FASTQ format.
Usage: maq sol2sanger in.sol.fastq out.sanger.fastq
Function: This tool is designed to translate results of the Kraken metagenomic classifier (see citations below) to the full representation of NCBI taxonomy.
Usage: kraken-report --db $DBNAME kraken.output
Function: PRINSEQ is a tool that generates summary statistics of sequence and quality data and that is used to filter, reformat and trim next-generation sequence data. It is particular designed for 454/Roche data, but can also be used for other types of sequence data. PRINSEQ is available through a user-friendly web interface or as standalone version. The standalone version is primarily designed for data preprocessing and does not generate summary statistics in graphical form.
Usage: [-fasta|-fastq] input_reads_pair_1.[fasta|fastq] [-fasta2|-fastq2] input_reads_pair_2.[fasta|fastq] -out_format [1|2|3|4|5] [options]
Function: GC content distribution of reads.
Usage: -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
Function: Calculate the RNA-seq reads coverage over gene body.
Usage: -r hg19.housekeeping.bed -i /data/alignment/ -o output
Function: Calculate the distributions of clipped nucleotides across reads
Usage: -i Pairend_StrandSpecific_51mer_Human_hg19.bam -s "PE" -o out
Function: Kraken is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads.
Usage: kraken --db $DBNAME seqs.fa
Function: The file sequences.labels generated by the above example is a text file with two tab-delimited columns, and one line for each classified sequence in sequences.fa; unclassified sequences are not reported by kraken-translate.
Usage: kraken-translate --db $DBNAME sequences.kraken > sequences.labels