Category

Sam/Bam Manipulation


Usage

samtools view [options] in.sam|in.bam|in.cram [region...]


Manual

With no options or regions specified, this command prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header).

You may specify one or more space-separated region specifications after the input filename to restrict output to only those alignments which overlap the specified region(s). Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).

The -X option can be used to allow users to specify customized index file location(s) if the data folder does not contain any index file. See EXAMPLES section for a sample of usage.

Required arguments

  • <in.bam>|<in.sam>|<in.cram> file: Input BAM, SAM, or CRAM file.

Options

Output options

If you want to change the output format from the default of headerless SAM:

  • -b, --bam: Output BAM.
  • -C, --cram: Output CRAM (requires -T).
  • -1, --fast: Use fast BAM compression (implies --bam).
  • -u, --uncompressed: Uncompressed BAM output (implies --bam).
  • -h, --with-header: Include header in SAM output.
  • -H, --header-only: Print SAM header only (no alignments).
  • --no-header: Print SAM alignment records only [default].
  • -c, --count: Print only the count of matching records. All filter options, such as -f-F, and -q, are taken into account. The -p option is ignored in this mode.
  • -p--unmap: Set the UNMAP flag on alignments that are not selected by the filter options. These alignments are then written to the normal output. This is not compatible with -U.

If you want to set the output file name(s):

  • -o, --output FILE: Write output to FILE [standard output].
  • -U, --unoutput FILE, --output-unselected FILE: Output reads not selected by filters to FILE.
Input options
  • -t, --fai-reference FILE: A tab-delimited FILE. Each line must contain the reference name in the first column and the length of the reference in the second column, with one line for each distinct reference. Any additional fields beyond the second column are ignored. This file also defines the order of the reference sequences in sorting. If you run: samtools faidx <ref.fa>, the resulting index file <ref.fa>.fai can be used as this FILE. One of -t or -T options is required when SAM input does not contain @SQ headers, and the -T option is required whenever writing CRAM output.
  • -M, --use-index: Use the multi-region iterator on the union of a BED file and command-line region arguments. This avoids re-reading the same regions of files so can sometimes be much faster. Note this also removes duplicate sequences. Without this a sequence that overlaps multiple regions specified on the command line will be reported multiple times. The usage of a BED file is optional and its path has to be preceded by -L option.
  • --region[s]-file FILE: Use index to include only reads overlapping FILE. Equivalent to -M -L FILE or --use-index --target-file FILE.
  • -X, --customized-index: Include customized index file as a part of arguments.
Filtering options

Filter the alignments that will be included in the output to only those alignments that match certain criteria.

  • -L, --target[s]-file FILE: ...overlap (BED) regions in FILE.
  • -r, --read-group STR: Output alignments in read group STR [null]. Note that records with no RG tag will also be output when using this option. This behaviour may change in a future release.
  • -R, --read-group-file FILE: Output alignments in read groups listed in FILE. Note that records with no RG tag will also be output when using this option. This behaviour may change in a future release.
  • -N, --qname-file FILE: Output only alignments with read names listed in FILE.
  • -d, --tag STR1[:STR2]: Only output alignments with tag STR1 and associated value STR2, which can be a string or an integer [null]. The value can be omitted, in which case only the tag is considered. Note that this option does not specify a tag type. For example, use -d XX:42 to select alignments with an XX:i:42 field, not -d XX:i:42.
  • -D, --tag-file STR:FILE: Only output alignments with tag STR whose value is listed in FILE.
  • -q, --min-MQ INT: Skip alignments with MAPQ smaller than INT.
  • -l, --library STR: Only output alignments in library STR.
  • -m, --min-qlen INT: Only output alignments with number of CIGAR bases consuming query sequence $\ge$ INT.
  • -e, --expr STR: Only include alignments that match the filter expression STR. The syntax for these expressions is described in the FILTER EXPRESSIONS heading. Available after version 1.12.
  • -f, --require-flags FLAG: Only output alignments with all bits set in FLAG present in the FLAG field. FLAG can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with `0' (i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of flag names.
  • -F, --excl[ude]-flags FLAG: Do not output alignments with any bits set in FLAG present in the FLAG field. FLAG can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with `0' (i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of flag names.
  • --rf FLAG , --incl-flags FLAG, --include-flags FLAG: Only output alignments with any bit set in FLAG present in the FLAG field. FLAG can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with `0' (i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of flag names.
  • -G FLAG: Do not output alignments with all bits set in INT present in the FLAG field. This is the opposite of -f such that -f12 -G12 is the same as no filtering at all. FLAG can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with `0' (i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of flag names.
  • --subsample FLOAT: Output only a proportion of the input alignments, FLOAT should be in the range $[0.0, 1.0]$, which gives the fraction of templates/pairs to be kept. This subsampling acts in the same way on all of the alignment records in the same template or read pair, so it never keeps a read but not its mate.
  • --subsample-seed INT: Subsampling seed used to influence which subset of reads is kept. When subsampling data that has previously been subsampled, be sure to use a different seed value from those used previously; otherwise more reads will be retained than expected [0].
  • -s INT.FRAC: Same as --subsample 0.FRAC and --subsample-seed INT.
Processing options
  • --add-flags FLAG: Adds flag(s) to read. FLAG can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with `0' (i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of flag names.
  • --remove-flags FLAG: Remove flag(s) from read. FLAG is specified in the same way as with the --add-flags option.
  • -x, --remove-tag STR: Read tag(s) to exclude from output (repeatable) [null]. This can be a single tag or a comma separated list. Alternatively the option itself can be repeated multiple times. If the list starts with a `^' then it is negated and treated as a request to remove all tags except those in STR. The list may be empty, so -x ^ will remove all tags. Note that tags will only be removed from reads that pass filtering.
  • --keep-tag STR: This keeps only tags listed in STR and is directly equivalent to --remove-tag ^STR. Specifying an empty list will remove all tags. If both --keep-tag and --remove-tag are specified then --keep-tag has precedence. Note that tags will only be removed from reads that pass filtering.
  • -B, --remove-B: Collapse the backward CIGAR operation.
General options
  • -?, --help: Print long help, including note about region specification.
  • -S: Ignored (input format is auto-detected).
  • --no-PG: Do not add a PG line.
  • --input-fmt-option OPT[=VAL]: Specify a single input file format option in the form of OPTION or OPTION=VALUE.
  • -O, --output-fmt FORMAT[,OPT[=VAL]]...: Specify output format (SAM, BAM, CRAM).
  • --output-fmt-option OPT[=VAL]: Specify a single output file format option in the form of OPTION or OPTION=VALUE.
  • -T, --reference FILE: A FASTA format reference FILE, optionally compressed by bgzip and ideally indexed by samtools faidx. If an index is not present one will be generated for you, if the reference file is local. If the reference file is not local, but is accessed instead via an https://, s3:// or other URL, the index file will need to be supplied by the server alongside the reference. It is possible to have the reference and index files in different locations by supplying both to this option separated by the string "##idx##", for example: -T ftp://x.com/ref.fa##idx##ftp://y.com/index.fa.fai. However, note that only the location of the reference will be stored in the output file header. If this method is used to make CRAM files, the cram reader may not be able to find the index, and may not be able to decode the file unless it can get the references it needs using a different method. One of -t or -T options is required when SAM input does not contain @SQ headers, and the -T option is required whenever writing CRAM output.
  • -@, --threads INT: Number of additional threads to use [0].
  • --write-index: Automatically index the output files [off].
  • -P--fetch-pairs: Retrieve pairs even when the mate is outside of the requested region. Enabling this option also turns on the multi-region iterator (-M). A region to search must be specified, either on the command-line, or using the -L option. The input file must be an indexed regular file. This option first scans the requested region, using the RNEXT and PNEXT fields of the records that have the PAIRED flag set and pass other filtering options to find where paired reads are located. These locations are used to build an expanded region list, and a set of QNAMEs to allow from the new regions. It will then make a second pass, collecting all reads from the originally-specified region list together with reads from additional locations that match the allowed set of QNAMEs. Any other filtering options used will be applied to all reads found during this second pass. As this option links reads using RNEXT and PNEXT, it is important that these fields are set accurately. Use 'samtools fixmate' to correct them if necessary. Note that this option does not work with the -c-U, or -p options.
  • -z FLAGs, --sanitize FLAGsPerform some sanity checks on the state of SAM record fields, fixing up common mistakes made by aligners. These include soft-clipping alignments when they extend beyond the end of the reference, marking records as unmapped when they have reference * or position 0, and ensuring unmapped alignments have no CIGAR or mapping quality for unmapped alignments and no MD, NM, CG or SM tags. FLAGs is a comma-separated list of keywords chosen from the following list.
    • unmap: The UNMAPPED BAM flag. This is set for reads with position <= 0, reference name "*" or reads starting beyond the end of the reference. Note CIGAR "*" is permitted for mapped data so does not trigger this.
    • pos: Position and reference name fields. These may be cleared when a sequence is unmapped due to the coordinates being beyond the end of the reference. Selecting this may change the sort order of the file, so it is not a part of the on compound argument.
    • mqual: Mapping quality. This is set to zero for unmapped reads.
    • cigar: Modifies CIGAR fields, either by adding soft-clips for reads that overlap the end of the reference or by clearing it for unmapped reads.
    • aux: For unmapped data, some auxiliary fields are meaningless and will be removed. These include NM, MD, CG and SM.
    • off: Perform no sanity fixing. This is the default
    • on: Sanitize data in a way that guarantees the same sort order. This is everything except for pos.
    • all: All sanitizing options, including pos.
  • --verbosity INT: Set level of verbosity.

Examples

Import SAM to BAM when @SQ lines are present in the header:

samtools view -b -o aln.bam aln.sam

If @SQ lines are absent:

samtools faidx ref.fa
samtools view -b -t ref.fa.fai -o aln.bam aln.sam

where ref.fa.fai is generated automatically by the faidx command.

Convert a BAM file to a CRAM file using a local reference sequence.

samtools view -C -T ref.fa -o aln.cram aln.bam

Convert a BAM file to a CRAM with NM and MD tags stored verbatim rather than calculating on the fly during CRAM decode, so that mixed data sets with MD/NM only on some records, or NM calculated using different definitions of mismatch, can be decoded without change. The second command demonstrates how to decode such a file. The request to not decode MD here is turning off auto-generation of both MD and NM; it will still emit the MD/NM tags on records that had these stored verbatim.

samtools view -C --output-fmt-option store_md=1 --output-fmt-option store_nm=1 -o aln.cram aln.bam
samtools view --input-fmt-option decode_md=0 -o aln.new.bam aln.cram

An alternative way of achieving the above is listing multiple options after the --output-fmt or -O option. The commands below are equivalent to the two above.

samtools view -O cram,store_md=1,store_nm=1 -o aln.cram aln.bam
samtools view --input-fmt cram,decode_md=0 -o aln.new.bam aln.cram

Include customized index file as a part of arguments.

samtools view [options] -X /data_folder/data.bam /index_folder/data.bai chrM:1-10

Output alignments in read group grp2 (records with no RG tag will also be in the output).

samtools view -r grp2 -o /data_folder/data.rg2.bam /data_folder/data.bam

Only keep reads with tag BC and were the barcode matches the barcodes listed in the barcode file.

samtools view -D BC:barcodes.txt -o /data_folder/data.barcodes.bam /data_folder/data.bam

Only keep reads with tag RG and read group grp2. This does almost the same than -r grp2 but will not keep records without the RG tag.

samtools view -d RG:grp2 -o /data_folder/data.rg2_only.bam /data_folder/data.bam

Remove the actions of samtools markdup. Clear the duplicate flag and remove the dt tag, keep the header.

samtools view -h --remove-flags DUP -x dt -o /data_folder/dat.no_dup_markings.bam /data_folder/data.bam

Filter expressions

Filter expressions are used as an on-the-fly checking of incoming SAM, BAM or CRAM records, discarding records that do not match the specified expression. The language used is primarily C style, but with a few differences in the precedence rules for bit operators and the inclusion of regular expression matching. The operator precedence, from strongest binding to weakest, is:

Grouping (, ) E.g. "(1+2)*3"
Values: literals, vars Numbers, strings and variables
Unary ops: +, -, !, ~ E.g. -10 +10, !10 (not), ~5 (bit not)
Math ops: *, /, % Multiply, division and (integer) modulo
Math ops: +, - Addition / subtraction
Bit-wise: & Integer AND
Bit-wise ^ Integer XOR
Bit-wise | Integer OR
Conditionals: >, >=, <, <=
Equality: ==, !=, =~, !~ =~ and !~ match regular expressions
Boolean: &&, || Logical AND / OR

Expressions are computed using floating point mathematics, so "10 / 4" evaluates to 2.5 rather than 2. They may be written as integers in decimal or "0x" plus hexadecimal, and floating point with or without exponents. However operations that require integers first do an implicit type conversion, so "7.9 % 5" is 2 and "7.9 & 4.1" is equivalent to "7 & 4", which is 4. Strings are always specified using double quotes. To get a double quote in a string, use backslash. Similarly a double backslash is used to get a literal backslash. For example ab\"c\\d is the string ab"c\d.

Comparison operators are evaluated as a match being 1 and a mismatch being 0, thus "(2 > 1) + (3 < 5)" evaluates as 2. All comparisons involving undefined (null) values are deemed to be false.

The variables are where the file format specifics are accessed from the expression. The variables correspond to SAM fields, for example to find paired alignments with high mapping quality and a very large insert size, we may use the expression "mapq >= 30 && (tlen >= 100000 || tlen <= -100000)". Valid variable names and their data types are:

endpos int Alignment end position (1-based)
flag int Combined FLAG field
flag.paired int Single bit, 0 or 1
flag.proper_pair int Single bit, 0 or 2
flag.unmap int Single bit, 0 or 4
flag.munmap int Single bit, 0 or 8
flag.reverse int Single bit, 0 or 16
flag.mreverse int Single bit, 0 or 32
flag.read1 int Single bit, 0 or 64
flag.read2 int Single bit, 0 or 128
flag.secondary int Single bit, 0 or 256
flag.qcfail int Single bit, 0 or 512
flag.dup int Single bit, 0 or 1024
flag.supplementary int Single bit, 0 or 2048
library string Library (LB header via RG)
mapq int Mapping quality
mpos int Synonym for pnext
mrefid int Mate reference number (0 based)
mrname string Synonym for rnext
ncigar int Number of cigar operations
pnext int Mate's alignment position (1-based)
pos int Alignment position (1-based)
qlen int Alignment length: no. query bases
qname string Query name
qual string Quality values (raw, 0 based)
refid int Integer reference number (0 based)
rlen int Alignment length: no. reference bases
rname string Reference name
rnext string Mate's reference name
sclen int Number of soft-clipped bases
seq string Sequence
tlen int Template length (insert size)
[XX] int / string XX tag value

Flags are returned either as the whole flag value or by checking for a single bit. Hence the filter expression flag.dup is equivalent to flag & 1024.

qlen and rlen are measured using the CIGAR string to count the number of query (sequence) and reference bases consumed. Note qlen may not exactly match the length of the seq field if the sequence is "*". sclen is the number of soft-clipped bases. When combined in qlen-sclen it can give the number of sequence bases used in the alignment, distinguishing between global alignment and local alignment length.

endpos is the (1-based inclusive) position of the rightmost mapped base of the read, as measured using the CIGAR string, and for mapped reads is equivalent to pos+rlen-1. For unmapped reads, it is the same as pos.

Reference names may be matched either by their string forms (rname and mrname) or as the Nth @SQ line (counting from zero) as stored in BAM using tid and mtid respectively.

Auxiliary tags are described in square brackets and these expand to either integer or string as defined by the tag itself (XX:Z:string or XX:i:int). For example [NM]>=10 can be used to look for alignments with many mismatches and [RG]=~"grp[ABC]-" will match the read-group string.

If no comparison is used with an auxiliary tag it is taken simply to be a test for the existence of that tag. So [NM] will return any record containing an NM tag, even if that tag is zero (NM:i:0). In htslib <= 1.15 negating this with ![NM] gave misleading results as it was true if the tag did not exist or did exist but was zero. Now this is strictly does-not-exist. An explicit exists([NM]) and !exists([NM]) function has also been added to make this intention clear.

Similarly in htslib <= 1.15 using [NM]!=0 was true both when the tag existed and was not zero as well as when the tag did not exist. From 1.16 onwards all comparison operators are only true for tags that exist, so [NM]!=0 works as expected.

Some simple functions are available to operate on strings. These treat the strings as arrays of bytes, permitting their length, minimum, maximum and average values to be computed. These are useful for processing Quality Scores.

length(x) Length of the string (excluding nul char)
min(x) Minimum byte value in the string
max(x) Maximum byte value in the string
avg(x) Average byte value in the string

Note that "avg" is a floating point value and it may be NAN for empty strings. This means that "avg(qual)" does not produce an error for records that have both seq and qual of "*". NAN values will fail any conditional checks, so e.g. "avg(qual) > 20" works and will not report these records. NAN also fails all equality, < and > comparisons, and returns zero when given as an argument to the exists function. It can be negated with !x in which case it becomes true.

Functions that operate on both strings and numerics:

exists(x) True if the value exists (or is explicitly true).
default(x,d) Value x if it exists or d if not.

Functions that apply only to numeric values:

sqrt(x) Square root of x
log(x) Natural logarithm of x
pow(x, y) Power function, x to the power of y
exp(x) Base-e exponential, equivalent to pow(e,x)

Protocols using this tool

PROcap preprocessing (with two replicates)
File formats this tool works with
BAMCRAMSAM

Share your experience or ask a question