Sam/Bam Manipulation

java -jar picard.jar
Function: Collect whole genome sequencing-related metrics. This tool computes metrics that are useful for evaluating coverage and performance of whole genome sequencing experiments. These metrics include the percentages of reads that pass minimal base- and mapping- quality filters as well as coverage (read-depth) levels. The histogram output is optional and for a given run, displays two separate outputs on the y-axis while using a single set of values for the x-axis. Specifically, the first column in the histogram table (x-axis) is labeled 'coverage' and represents different possible coverage depths. However, it also represents the range of values for the base quality scores and thus should probably be labeled 'sequence depth and base quality scores'. The second and third columns (y-axes) correspond to the numbers of bases at a specific sequence depth 'count' and the numbers of bases at a particular base quality score 'baseq_count' respectively.Although similar to the CollectWgsMetrics tool, the default thresholds for CollectRawWgsMetrics are less stringent. For example, the CollectRawWgsMetrics have base and mapping quality score thresholds set to '3' and '0' respectively, while the CollectWgsMetrics tool has the default threshold values set to '20' (at time of writing). Nevertheless, both tools enable the user to input specific threshold values.
Usage: java -jar picard.jar CollectRawWgsMetrics I=input.bam O=raw_wgs_metrics.txt R=reference_sequence.fasta INCLUDE_BQ_HISTOGRAM=true
java -jar picard.jar
Function: Subsets intervals from a reference sequence to a new FASTA file.This tool takes a list of intervals, reads the corresponding subsquences from a reference FASTA file and writes them to a new FASTA file as separate records. Note that the reference FASTA file must be accompanied by an index file and the interval list must be provided in Picard list format. The names provided for the intervals will be used to name the corresponding records in the output file.
Usage: java -jar picard.jar ExtractSequences INTERVAL_LIST=regions_of_interest.interval_list R=reference.fasta O=extracted_IL_sequences.fasta
bam2fq.py
Function: Convert alignments in BAM or SAM format into fastq format.
Usage: bam2fq.py -i test_PairedEnd_StrandSpecific_hg19.sam -o bam2fq_out1
junction_saturation.py
Function: It’s very important to check if current sequencing depth is deep enough to perform alternative splicing analyses. For a well annotated organism, the number of expressed genes in particular tissue is almost fixed so the number of splice junctions is also fixed. The fixed splice junctions can be predetermined from reference gene model. All (annotated) splice junctions should be rediscovered from a saturated RNA-seq data, otherwise, downstream alternative splicing analysis is problematic because low abundance splice junctions are missing. This module checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments from BAM or SAM file, and then detects splice junctions from each subset and compares them to reference gene model.
Usage: junction_saturation.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -r hg19.refseq.bed12 -o output
java -jar picard.jar
Function: Lifts over an interval list from one reference build to another. This tool adjusts the coordinates in an interval list derived from one reference to match a new reference, based on a chain file that describes the correspondence between the two references. It is based on the UCSC liftOver tool (see: http://genome.ucsc.edu/cgi-bin/hgLiftOver) and uses a UCSC chain file to guide its operation. It accepts both Picard interval_list files or VCF files as interval inputs.
Usage: java -jar picard.jar LiftOverIntervalList I=input.interval_list O=output.interval_list SD=reference_sequence.dict CHAIN=build.chain
samtools flagstat
Function: Uses samtools flagstat command to print descriptive information for a BAM dataset.
Usage: samtools flagstat in.sam|in.bam|in.cram
java -jar picard.jar
Function: Collect metrics to assess oxidative artifacts.This tool collects metrics quantifying the error rate resulting from oxidative artifacts. For a brief primer on oxidative artifacts, see the GATK Dictionary.This tool calculates the Phred-scaled probability that an alternate base call results from an oxidation artifact. This probability score is based on base context, sequencing read orientation, and the characteristic low allelic frequency. Please see the following reference for an in-depth discussion of the OxoG error rate.
Usage: java -jar picard.jar CollectOxoGMetrics I=input.bam O=oxoG_metrics.txt R=reference_sequence.fasta
java -jar picard.jar
Function: Chart the distribution of quality scores.
Usage: java -jar picard.jar QualityScoreDistribution I=input.bam O=qual_score_dist.txt CHART=qual_score_dist.pdf
java -jar picard.jar
Function: Takes a SAM or BAM file and separates all the reads into one SAM or BAM file per library name. Reads that do not have a read group specified or whose read group does not have a library name are written to a file called 'unknown.' The format (SAM or BAM) of the output files matches that of the input file.
Usage: java -jar picard.jar SplitSamByLibrary
java -jar picard.jar
Function: Asserts the provided gzip file's (e.g., BAM) last block is well-formed; RC 100 otherwise
Usage: java -jar picard.jar CheckTerminatorBlock
java -jar picard.jar
Function: Merges multiple VCF or BCF files into one VCF file. Input files must be sorted by their contigs and, within contigs, by start position. The input files must have the same sample and contig lists. An index file is created and a sequence dictionary is required by default.
Usage: java -jar picard.jar MergeVcfs
read_distribution.py
Function: Provided a BAM/SAM file and reference gene model, this module will calculate how mapped reads were distributed over genome feature (like CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions). When genome features are overlapped (e.g. a region could be annotated as both exon and intron by two different transcripts) , they are prioritize as: CDS exons > UTR exons > Introns > Intergenic regions, for example, if a read was mapped to both CDS exon and intron, it will be assigned to CDS exons.
Usage: read_distribution.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -r hg19.refseq.bed12
java -jar picard.jar
Function: Collect metrics regarding GC bias. This tool collects information about the relative proportions of guanine (G) and cytosine (C) nucleotides in a sample. Regions of high and low G + C content have been shown to interfere with mapping/aligning, ultimately leading to fragmented genome assemblies and poor coverage in a phenomenon known as 'GC bias'. Detailed information on the effects of GC bias on the collection and analysis of sequencing data can be found at DOI: 10.1371/journal.pone.0062856/.
Usage: java -jar picard.jar CollectGcBiasMetrics I=input.bam O=gc_bias_metrics.txt CHART=gc_bias_metrics.pdf S=summary_metrics.txt R=reference_sequence.fasta
bamtools
Function: The command bamtools resolve resolves paired-end reads. The resolving mode is required, and it can be -makeStats, -markPairs, or -twoPass.
Usage: bamtools resolve -twoPass -in input_alignments.bam -out output_alignments.bam
java -jar picard.jar
Function: Transforms raw Illumina sequencing data into an unmapped SAM or BAM file.
Usage: java -jar picard.jar IlluminaBasecallsToSam BASECALLS_DIR=/BaseCalls/ LANE=001 READ_STRUCTURE=25T8B25T RUN_BARCODE=run15 IGNORE_UNEXPECTED_BARCODES=true LIBRARY_PARAMS=library.params