Category

Reads Manipulation


Usage

umi_tools dedup [OPTIONS] [--stdin=IN_BAM] [--stdout=OUT_BAM] > OUTFILE


Manual

Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read. The identification of duplicate reads is performed in an error-aware manner by building networks of related UMIs (see --method). dedup can also handle cell barcoded input (see --per-cell).

For every group of duplicate reads, a single representative read is retained.The following criteria are applied to select the read that will be retained from a group of duplicated reads:

  1. The read with the lowest number of mapping coordinates (see --multimapping-detection-method option)
  2. The read with the highest mapping quality. Note that this is not the read sequencing quality and that if two reads have the same mapping quality then one will be picked at random regardless of the read quality.

Otherwise a read is chosen at random.

Options

Dedup-specific options
  • --output-stats=STATS: Specify location to output edit distance statistics and UMI usage statistics. Output files are: 
    • [PREFIX]_stats_per_umi_per_position.tsv: Histogram of counts per position per UMI pre- and post-deduplication
    • [PREFIX]_stats_per_umi_per.tsv: Table of stats per umi. Number of times UMI was observed, total counts and median counts, pre- and post-deduplication
    • [PREFIX]_stats_edit_distance.tsv: Edit distance between UMIs at each position. Positions with a single UMI are reported seperately. Pre- and post-deduplication and inluding null expectations from random sampling of UMIs from the UMIs observed across all positions.
Barcode extraction options

It is assumed that the FASTQ files were processed with umi_tools extract before mapping and thus the UMI is the last word of the read name. e.g::

@HISEQ:87:00000000_AATT

where AATT is the UMI sequeuence. If you have used an alternative method which does not separate the read id and UMI with a "_", such as bcl2fastq which uses ":", you can specify the separator with the option --umi-separator=, replacing with e.g ":". 

Alternatively, if your UMIs are encoded in a tag, you can specify this by setting the option --extract-umi-method=tag and set the tag name with the --umi-tag option. For example, if your UMIs are encoded in the 'UM' tag, provide the following options: --extract-umi-method=tag and --umi-tag=UM

Finally, if you have used umis to extract the UMI +/- cell barcode, you can specify --extract-umi-method=umis

The start position of a read is considered to be the start of its alignment minus any soft clipped bases. A read aligned at position 500 with cigar 2S98M will be assumed to start at position 498.

  • --extract-umi-method=GET_UMI_METHOD: how is the read UMI +/ cell barcode encoded?
    • read_id (default): Barcodes are contained at the end of the read separated as specified with --umi-separator option

    • tag: Barcodes contained in a tag(s), see --umi-tag/--cell-tag options

    • umis: Barcodes were extracted using umis (https://github.com/vals/umis)

  • --umi-separator=UMI_SEP: separator between read id and UMI. Default: _
  • --umi-tag=UMI_TAG: tag containing umi
  • --umi-tag-split=UMI_TAG_SPLIT: split UMI in tag and take the first element
  • --umi-tag-delimiter=UMI_TAG_DELIM: concatenate UMI in tag separated by delimiter
  • --cell-tag=CELL_TAG: tag containing cell barcode
  • --cell-tag-split=CELL_TAG_SPLIT: split cell barcode in tag and take the first elementfor e.g 10X GEM tags
  • --cell-tag-delimiter=CELL_TAG_DELIM: concatenate cell barcode in tag separated by delimiter
UMI grouping options
  • --method=METHOD: method to use for umi grouping. All methods start by identifying the reads with the same mapping position.

    The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene.

    • unique: Reads group share the exact same UMI

    • percentile: Reads group share the exact same UMI. UMIs with counts < 1% of the median counts for UMIs at the same position are ignored.

    • cluster: Identify clusters of connected UMIs (based on hamming distance threshold). Each network is a read group

    • adjacency: Cluster UMIs as above. For each cluster, select the node (UMI) with the highest counts. Visit all nodes one edge away. If all nodes have been visited, stop. Otherwise, repeat with remaining nodes until all nodes have been visted. Each step defines a read group.

    • directional (default): Identify clusters of connected UMIs (based on hamming distance threshold) and umi A counts >= (2* umi B counts) - 1. Each network is a read group.

  • --edit-distance-threshold=THRESHOLD: Edit distance theshold at which to join two UMIs when grouping UMIs. [default=1]
  • --spliced-is-unique: Treat a spliced read as different to an unspliced one [default=False]
  • --soft-clip-threshold=SOFT_CLIP_THRESHOLD: number of bases clipped from 5' end before read is counted as spliced [default=4]
  • --read-length: use read length in addition to position and UMI to identify possible duplicates [default=False]
Single-cell RNA-Seq options
  • --per-gene: Reads will be grouped together if they have the same gene. This is useful if your library prep generates PCR duplicates with non identical alignment positions such as CEL-Seq. Note this option is hardcoded to be on with the count command. I.e counting is always performed per-gene. Must combine with either --gene-tag or --per-contig
  • --gene-tag=GENE_TAG: Gene is defined by this bam tag [default=none]
  • --assigned-status-tag=ASSIGNED_TAG: Bam tag describing whether read is assigned to a gene. By defualt, this is set as the same tag as --gene-tag
  • --skip-tags-regex=SKIP_REGEX: Used with --gene-tag. Ignore reads where the gene-tag matches this regex. Default: "^[__|Unassigned]"
  • --per-contig: Deduplicate per contig (field 3 in BAM; RNAME). All reads with the same contig will be considered to have the same alignment position. This is useful if you have aligned to a reference transcriptome with one transcript per gene. If you have aligned to a transcriptome with more than one transcript per gene, you can supply a map between transcripts and gene using the --gene-transcript-map option
  • --gene-transcript-map=GENE_TRANSCRIPT_MAP: File mapping transcripts to genes (tab separated), e.g:
    gene1   transcript1
    gene1   transcript2
    gene2   transcript3
  • --per-cell: group/dedup/count per cell
Group/dedup options
  • --buffer-whole-contig: Read whole contig before outputting bundles: guarantees that no reads are missed, but increases memory usage
  • --multimapping-detection-method=[NH/X0/XT]: Some aligners identify multimapping using bam tags. Setting this option to NH, X0 or XT will use these tags when selecting the best read amongst reads with the same position and umi [default=none]
SAM/BAM options
  • --mapping-quality=MAPPING_QUALITY: Minimum mapping quality for a read to be retained [default=0]
  • --unmapped-reads=UNMAPPED_READS: How to handle unmapped reads. Options are 'discard', 'use' or 'correct' [default=discard]
  • --chimeric-pairs=CHIMERIC_PAIRS: How to handle chimeric read pairs. Options are 'discard', 'use' or 'correct' [default=use]
  • --unpaired-reads=UNPAIRED_READS: How to handle unpaired reads. Options are 'discard', 'use' or 'correct' [default=use]
  • --ignore-umi: Ignore UMI and dedup only on position
  • --chrom=CHROM: Restrict to one chromosome
  • --subset=SUBSET: Use only a fraction of reads, specified by subset
  • -i, --in-sam: Input file is in sam format [default=False]
  • --paired: paired input BAM. [default=False]
  • -o, --out-sam: Output alignments in sam format [default=False]
  • --no-sort-output: Don't Sort the output
Input/output options
  • -I FILE, --stdin=FILE: file to read stdin from [default = stdin].
  • -L FILE, --log=FILE: file with logging information [default = stdout].
  • -E FILE, --error=FILE: file with error information [default = stderr].
  • -S FILE, --stdout=FILE: file where output is to go [default = stdout].
  • --temp-dir=FILE: Directory for temporary files. If not set, the bash environmental variable TMPDIR is used[default = None].
  • --log2stderr: send logging information to stderr [default = False].
  • --compresslevel=COMPRESSLEVEL: Level of Gzip compression to use. Default (6) matches GNU gzip rather than python gzip default (which is 9)
Profiling options
  • --timeit=TIMEIT_FILE: store timeing information in file [none].
  • --timeit-name=TIMEIT_NAME: name in timing file for this class of jobs [all].
  • --timeit-header: add header for timing information [none].
Common options
  • -v LOGLEVEL, --verbose=LOGLEVEL: loglevel [1]. The higher, the more output.
  • -h, --help: output short help (command line options only).
  • --help-extended: Output full documentation
  • --random-seed=RANDOM_SEED: random seed to initialize number generator with [none].
  • --version: show program's version number and exit

Protocols using this tool

PROcap preprocessing (with two replicates)

Share your experience or ask a question