java -jar picard.jar

Reference

Identifies duplicate reads, accounting for mate CIGAR. This tool locates and tags duplicate reads (both PCR and optical) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA, taking into account the CIGAR string of read mates. It is intended as an improvement upon the original MarkDuplicates algorithm, from which it differs in several ways, includingdifferences in how it breaks ties. It may be the most effective duplicate marking program available, as it handles all cases including clipped and gapped alignments and locates duplicate molecules using mate cigar information. However, please note that it is not yet used in the Broad's production pipeline, so use it at your own risk. Note also that this tool will not work with alignments that have large gaps or deletions, such as those from RNA-seq data. This is due to the need to buffer small genomic windows to ensure integrity of the duplicate marking, while large skips (ex. skipping introns) in the alignment records would force making that window very large, thus exhausting memory.

Usage

java -jar picard.jar MarkDuplicatesWithMateCigar I=input.bam O=mark_dups_w_mate_cig.bam M=mark_dups_w_mate_cig_metrics.txt

Manual

MINIMUM_DISTANCE (Integer)    The minimum distance to buffer records to account for clipping on the 5' end of the records. For a given alignment, this parameter controls the width of the window to search for duplicates of that alignment. Due to 5' read clipping, duplicates do not necessarily have the same 5' alignment coordinates, so the algorithm needs to search around the neighborhood. For single end sequencing data, the neighborhood is only determined by the amount of clipping (assuming no split reads), thus setting MINIMUM_DISTANCE to twice the sequencing read length should be sufficient. For paired end sequencing, the neighborhood is also determined by the fragment insert size, so you may want to set MINIMUM_DISTANCE to something like twice the 99.5% percentile of the fragment insert size distribution (see CollectInsertSizeMetrics). Or you can set this number to -1 to use either a) twice the first read's read length, or b) 100, whichever is smaller. Note that the larger the window, the greater the RAM requirements, so you could run into performance limitations if you use a value that is unnecessarily large. Default value: -1. This option can be set to 'null' to clear the default value.
SKIP_PAIRS_WITH_NO_MATE_CIGAR (Boolean)    Skip record pairs with no mate cigar and include them in the output. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
BLOCK_SIZE (Integer)    The block size for use in the coordinate-sorted record buffer. Default value: 100000. This option can be set to 'null' to clear the default value.
INPUT (String)    One or more input SAM or BAM files to analyze. Must be coordinate sorted. Default value: null. This option may be specified 0 or more times.
OUTPUT (File)    The output file to write marked records to Required.
METRICS_FILE (File)    File to write duplication metrics to Required.
REMOVE_DUPLICATES (Boolean)    If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ASSUME_SORTED (Boolean)    If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) ASSUME_SORT_ORDER (ASO)
ASSUME_SORT_ORDER (SortOrder)    If not null, assume that the input file has this order even if the header says otherwise. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate} Cannot be used in conjuction with option(s) ASSUME_SORTED (AS)
DUPLICATE_SCORING_STRATEGY (ScoringStrategy)    The scoring strategy for choosing the non-duplicate among candidates. Default value: TOTAL_MAPPED_REFERENCE_LENGTH. This option can be set to 'null' to clear the default value. Possible values: {SUM_OF_BASE_QUALITIES, TOTAL_MAPPED_REFERENCE_LENGTH, RANDOM}
PROGRAM_RECORD_ID (String)    The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs. Default value: MarkDuplicates. This option can be set to 'null' to clear the default value.
PROGRAM_GROUP_VERSION (String)    Value of VN tag of PG record to be created. If not specified, the version will be detected automatically. Default value: null.
PROGRAM_GROUP_COMMAND_LINE (String)    Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically. Default value: null.
PROGRAM_GROUP_NAME (String)    Value of PN tag of PG record to be created. Default value: MarkDuplicatesWithMateCigar. This option can be set to 'null' to clear the default value.
COMMENT (String)    Comment(s) to include in the output file's header. Default value: null. This option may be specified 0 or more times.
READ_NAME_REGEX (String)    Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value: . This option can be set to 'null' to clear the default value.
OPTICAL_DUPLICATE_PIXEL_DISTANCE (Integer)    The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. Default value: 100. This option can be set to 'null' to clear the default value.

java -jar picard.jar

Category

Usage

Manual

Share your experience or ask a question