Reference Code backup Executable files
View, convert format, or filter (with different criteria) alignments.
samtools view [options] in.sam|in.bam|in.cram [region...]
With no options or regions specified, this command prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header).
You may specify one or more space-separated region specifications after the input filename to restrict output to only those alignments which overlap the specified region(s). Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).
The -X option can be used to allow users to specify customized index file location(s) if the data folder does not contain any index file. See EXAMPLES section for a sample of usage.
If you want to change the output format from the default of headerless SAM:
If you want to set the output file name(s):
samtools faidx <ref.fa>
, the resulting index file <ref.fa>.fai can be used as this FILE. One of -t or -T options is required when SAM input does not contain @SQ headers, and the -T option is required whenever writing CRAM output.-L FILE
or --use-index --target-file FILE
.Filter the alignments that will be included in the output to only those alignments that match certain criteria.
-d XX:42
to select alignments with an XX:i:42 field, not -d XX:i:42
.--subsample 0.FRAC
and --subsample-seed INT
.-x ^
will remove all tags. Note that tags will only be removed from reads that pass filtering.--remove-tag ^STR
. Specifying an empty list will remove all tags. If both --keep-tag and --remove-tag are specified then --keep-tag has precedence. Note that tags will only be removed from reads that pass filtering.samtools faidx
. If an index is not present one will be generated for you, if the reference file is local. If the reference file is not local, but is accessed instead via an https://, s3:// or other URL, the index file will need to be supplied by the server alongside the reference. It is possible to have the reference and index files in different locations by supplying both to this option separated by the string "##idx##", for example: -T ftp://x.com/ref.fa##idx##ftp://y.com/index.fa.fai
. However, note that only the location of the reference will be stored in the output file header. If this method is used to make CRAM files, the cram reader may not be able to find the index, and may not be able to decode the file unless it can get the references it needs using a different method. One of -t or -T options is required when SAM input does not contain @SQ headers, and the -T option is required whenever writing CRAM output.Import SAM to BAM when @SQ lines are present in the header:
samtools view -b -o aln.bam
aln.sam
If @SQ lines are absent:
samtools faidx ref.fa samtools view -b-t ref.fa.fai
-o aln.bam
aln.sam
where ref.fa.fai is generated automatically by the faidx command.
Convert a BAM file to a CRAM file using a local reference sequence.
samtools view -C-T ref.fa
-o aln.cram
aln.bam
Convert a BAM file to a CRAM with NM and MD tags stored verbatim rather than calculating on the fly during CRAM decode, so that mixed data sets with MD/NM only on some records, or NM calculated using different definitions of mismatch, can be decoded without change. The second command demonstrates how to decode such a file. The request to not decode MD here is turning off auto-generation of both MD and NM; it will still emit the MD/NM tags on records that had these stored verbatim.
samtools view -C--output-fmt-option store_md=1
--output-fmt-option store_nm=1
-o aln.cram
aln.bam samtools view--input-fmt-option decode_md=0
-o aln.new.bam
aln.cram
An alternative way of achieving the above is listing multiple options after the --output-fmt or -O option. The commands below are equivalent to the two above.
samtools view-O cram,store_md=1,store_nm=1
-o aln.cram
aln.bam samtools view--input-fmt cram,decode_md=0
-o aln.new.bam
aln.cram
Include customized index file as a part of arguments.
samtools view [options] -X /data_folder/data.bam /index_folder/data.bai chrM:1-10
Output alignments in read group grp2 (records with no RG tag will also be in the output).
samtools view-r grp2
-o /data_folder/data.rg2.bam
/data_folder/data.bam
Only keep reads with tag BC and were the barcode matches the barcodes listed in the barcode file.
samtools view-D BC:barcodes.txt
-o /data_folder/data.barcodes.bam
/data_folder/data.bam
Only keep reads with tag RG and read group grp2. This does almost the same than -r grp2 but will not keep records without the RG tag.
samtools view-d RG:grp2
-o /data_folder/data.rg2_only.bam
/data_folder/data.bam
Remove the actions of samtools markdup. Clear the duplicate flag and remove the dt tag, keep the header.
samtools view -h--remove-flags DUP
-x dt
-o /data_folder/dat.no_dup_markings.bam
/data_folder/data.bam
Filter expressions are used as an on-the-fly checking of incoming SAM, BAM or CRAM records, discarding records that do not match the specified expression. The language used is primarily C style, but with a few differences in the precedence rules for bit operators and the inclusion of regular expression matching. The operator precedence, from strongest binding to weakest, is:
Grouping | (, ) | E.g. "(1+2)*3" |
---|---|---|
Values: | literals, vars | Numbers, strings and variables |
Unary ops: | +, -, !, ~ | E.g. -10 +10, !10 (not), ~5 (bit not) |
Math ops: | *, /, % | Multiply, division and (integer) modulo |
Math ops: | +, - | Addition / subtraction |
Bit-wise: | & | Integer AND |
Bit-wise | ^ | Integer XOR |
Bit-wise | | | Integer OR |
Conditionals: | >, >=, <, <= | |
Equality: | ==, !=, =~, !~ | =~ and !~ match regular expressions |
Boolean: | &&, || | Logical AND / OR |
Expressions are computed using floating point mathematics, so "10 / 4" evaluates to 2.5 rather than 2. They may be written as integers in decimal or "0x" plus hexadecimal, and floating point with or without exponents. However operations that require integers first do an implicit type conversion, so "7.9 % 5" is 2 and "7.9 & 4.1" is equivalent to "7 & 4", which is 4. Strings are always specified using double quotes. To get a double quote in a string, use backslash. Similarly a double backslash is used to get a literal backslash. For example ab\"c\\d is the string ab"c\d.
Comparison operators are evaluated as a match being 1 and a mismatch being 0, thus "(2 > 1) + (3 < 5)" evaluates as 2. All comparisons involving undefined (null) values are deemed to be false.
The variables are where the file format specifics are accessed from the expression. The variables correspond to SAM fields, for example to find paired alignments with high mapping quality and a very large insert size, we may use the expression "mapq >= 30 && (tlen >= 100000 || tlen <= -100000)". Valid variable names and their data types are:
endpos | int | Alignment end position (1-based) |
---|---|---|
flag | int | Combined FLAG field |
flag.paired | int | Single bit, 0 or 1 |
flag.proper_pair | int | Single bit, 0 or 2 |
flag.unmap | int | Single bit, 0 or 4 |
flag.munmap | int | Single bit, 0 or 8 |
flag.reverse | int | Single bit, 0 or 16 |
flag.mreverse | int | Single bit, 0 or 32 |
flag.read1 | int | Single bit, 0 or 64 |
flag.read2 | int | Single bit, 0 or 128 |
flag.secondary | int | Single bit, 0 or 256 |
flag.qcfail | int | Single bit, 0 or 512 |
flag.dup | int | Single bit, 0 or 1024 |
flag.supplementary | int | Single bit, 0 or 2048 |
library | string | Library (LB header via RG) |
mapq | int | Mapping quality |
mpos | int | Synonym for pnext |
mrefid | int | Mate reference number (0 based) |
mrname | string | Synonym for rnext |
ncigar | int | Number of cigar operations |
pnext | int | Mate's alignment position (1-based) |
pos | int | Alignment position (1-based) |
qlen | int | Alignment length: no. query bases |
qname | string | Query name |
qual | string | Quality values (raw, 0 based) |
refid | int | Integer reference number (0 based) |
rlen | int | Alignment length: no. reference bases |
rname | string | Reference name |
rnext | string | Mate's reference name |
sclen | int | Number of soft-clipped bases |
seq | string | Sequence |
tlen | int | Template length (insert size) |
[XX] | int / string | XX tag value |
Flags are returned either as the whole flag value or by checking for a single bit. Hence the filter expression flag.dup is equivalent to flag & 1024.
qlen and rlen are measured using the CIGAR string to count the number of query (sequence) and reference bases consumed. Note qlen may not exactly match the length of the seq field if the sequence is "*". sclen is the number of soft-clipped bases. When combined in qlen-sclen it can give the number of sequence bases used in the alignment, distinguishing between global alignment and local alignment length.
endpos is the (1-based inclusive) position of the rightmost mapped base of the read, as measured using the CIGAR string, and for mapped reads is equivalent to pos+rlen-1. For unmapped reads, it is the same as pos.
Reference names may be matched either by their string forms (rname and mrname) or as the Nth @SQ line (counting from zero) as stored in BAM using tid and mtid respectively.
Auxiliary tags are described in square brackets and these expand to either integer or string as defined by the tag itself (XX:Z:string or XX:i:int). For example [NM]>=10 can be used to look for alignments with many mismatches and [RG]=~"grp[ABC]-" will match the read-group string.
If no comparison is used with an auxiliary tag it is taken simply to be a test for the existence of that tag. So [NM] will return any record containing an NM tag, even if that tag is zero (NM:i:0). In htslib <= 1.15 negating this with ![NM] gave misleading results as it was true if the tag did not exist or did exist but was zero. Now this is strictly does-not-exist. An explicit exists([NM]) and !exists([NM]) function has also been added to make this intention clear.
Similarly in htslib <= 1.15 using [NM]!=0 was true both when the tag existed and was not zero as well as when the tag did not exist. From 1.16 onwards all comparison operators are only true for tags that exist, so [NM]!=0 works as expected.
Some simple functions are available to operate on strings. These treat the strings as arrays of bytes, permitting their length, minimum, maximum and average values to be computed. These are useful for processing Quality Scores.
length(x) | Length of the string (excluding nul char) |
---|---|
min(x) | Minimum byte value in the string |
max(x) | Maximum byte value in the string |
avg(x) | Average byte value in the string |
Note that "avg" is a floating point value and it may be NAN for empty strings. This means that "avg(qual)" does not produce an error for records that have both seq and qual of "*". NAN values will fail any conditional checks, so e.g. "avg(qual) > 20" works and will not report these records. NAN also fails all equality, < and > comparisons, and returns zero when given as an argument to the exists function. It can be negated with !x in which case it becomes true.
Functions that operate on both strings and numerics:
exists(x) | True if the value exists (or is explicitly true). |
---|---|
default(x,d) | Value x if it exists or d if not. |
Functions that apply only to numeric values:
sqrt(x) | Square root of x |
---|---|
log(x) | Natural logarithm of x |
pow(x, y) | Power function, x to the power of y |
exp(x) | Base-e exponential, equivalent to pow(e,x) |