bcftools consensus manual with usage examples

Usage

bcftools consensus [OPTIONS] <file.vcf.gz>

Manual

bcftools consensus is a command in the BCFtools suite.

Sometimes there is the need to create a consensus sequence for an individual where the sequence incorporates variants typed for this individual (via the --samples option). This is possible using the consensus command.

We need the reference sequence reference.fa in the fasta format and an indexed VCF with the variants calls.bcf. The command is:

$ cat reference.fa | bcftools consensus calls.vcf.gz > consensus.fa

In order to keep the program relatively simple, consensus merely applies variants from the VCF regardless of AF or other properties, and symbolic alleles other than <DEL>, <*> or <NON_REF> are currently not supported, e.g. <INS>. If more more advanced logic is required, the VCF has to be preprocessed with other tools like FastaAlternateReferenceMaker from the GATK suite.

Required arguments

file.vcf.gz: Input VCF file

Options

-c, --chain FILE: Write a chain file for liftover
-a, --absent CHAR: Replace positions absent from VCF with CHAR
-e, --exclude EXPR: Exclude sites for which the EXPR is true
-f, --fasta-ref FILE: Reference sequence in fasta format. You can also specify the reference sequence via pipe, in this case, you don't need to specify the -f option.
-H, --haplotype WHICH: Choose which allele to use from the FORMAT/GT field, note the codes are case-insensitive:
- N: N={1,2,3,...} is the index of the allele from GT, regardless of phasing (e.g. "2", the second allele, regardless of phasing)
- R: REF allele in heterozygous genotypes
- A: ALT allele in heterozygous genotypes
- I: IUPAC code for all genotypes (automatically triggers -i)
- LR, LA: the longer allele. If both have the same length, use the REF allele (LR), or the ALT allele (LA)
- SR, SA: the shorter allele. If both have the same length, use the REF allele (SR), or the ALT allele (SA)
- 1pIu, 2pIu: first/second allele for phased genotypes and IUPAC code for unphased genotypes. This option requires -s, unless exactly one sample is present in the VCF
-i, --include EXPR: Select sites for which the expression is true (see man page for details)
-I, --iupac-codes: output variants in the form of IUPAC ambiguity codes determined from FORMAT/GT fields. By default all samples are used and can be subset with -s, --samples and -S, --samples-file. Use -s - to ignore samples and use only the REF and ALT columns. NOTE: prior to version 1.17 the IUPAC codes were determined solely from REF,ALT columns and sample genotypes were not considered.
--mark-del CHAR: Instead of removing sequence, insert character CHAR for deletions
--mark-ins uc|lc|CHAR: Highlight insertions in uppercase (uc), lowercase (lc), or use CHAR, leaving the rest as is
--mark-snv uc|lc|CHAR: Highlight substitutions in uppercase (uc), lowercase (lc), or use CHAR, leaving the rest as is
-m, --mask FILE: Replace regions according to the next --mask-with option. The default is --mask-with N
--mask-with CHAR|uc|lc: Replace with CHAR (skips overlapping variants); change to uppercase (uc) or lowercase (lc)
-M, --missing CHAR: Output CHAR instead of skipping a missing genotype "./."
-o, --output FILE: Write output to a file. By default, output will be directed to standard output.
-p, --prefix STRING: Prefix to add to output sequence names
-s, --samples LIST: Comma-separated list of samples to include, "-" to ignore samples and use REF,ALT
-S, --samples-file FILE: File of samples to include

Examples

Apply variants present in sample "NA001"

The following example will apply variants present in sample "NA001", output IUPAC codes for heterogeneity

$ bcftools consensus -s NA001 -f in.fa in.vcf.gz > out.fa

Output IUPAC ambiguity codes based on REF+ALT columns (regardless of genotype)

$ cat reference.fa | bcftools consensus --iupac-codes calls.vcf.gz > consensus.fa

Output IUPAC ambiguity codes based on sample genotypes

$ cat reference.fa | bcftools consensus --haplotype I calls.vcf.gz > consensus.fa

Get the consensus for one region

$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa

Appendix: Valid expressions

Valid expressions may contain

numerical constants (e.g. 1, 1.0, 1e-4), string constants (e.g. "String"), file names (e.g. @file_name, this is currently supported only to filter by the ID column)
arithmetic operators (addition +, multiplication *, subtraction -, division /, modulo %)
comparison operators: == (same as =), >, >=, <=, <, !=
regex operators "\~" and its negation "!~". The expressions are case sensitive unless "/i" is added.
parentheses: (, )
logical operators. &&, &, ||, |.
INFO tags, FORMAT tags, column names: INFO/DP or DP FORMAT/DV, FMT/DV, or DV FILTER, QUAL, ID, CHROM, POS, REF, ALT[0]

the FILTER column can be queried as follows:

FILTER="PASS"
FILTER="."
FILTER="A"          .. exact match, for example "A;B" does not pass
FILTER="A;B"        .. exact match, "A;B" and "B;A" pass, everything else fails
FILTER!="A"         .. exact match, for example "A;B" does pass
FILTER~"A"          .. subset match, for example both "A" and "A;B" pass
FILTER~"A;B"        .. subset match, pass only if both "A" and "B" are present
FILTER!~"A"         .. complement match, for example both "A" and "A;B" fail
FILTER!~"A;B"       .. complement match, fail if both "A" and "B" are present

1 (or 0) to test the presence (or absence) of a flag
```
FlagA=1 && FlagB=0
```
"." to test missing values
```
DP=".", DP!=".", ALT="."
```
missing genotypes can be matched regardless of phase and ploidy (".|.", "./.", ".", "0|.") using these expressions
```
GT="mis", GT~"\.", GT!~"\."
```
missing genotypes can be matched including the phase and ploidy (".|.", "./.", ".") using these expressions
```
GT=".|.", GT="./.", GT="."
```
sample genotype: reference (haploid or diploid), alternate (hom or het, haploid or diploid), missing genotype, homozygous, heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het, alt-alt het, haploid ref, haploid alt (case-insensitive)
```
GT="ref"
GT="alt"
GT="mis"
GT="hom"
GT="het"
GT="hap"
GT="RR"
GT="AA"
GT="RA" or GT="AR"
GT="Aa" or GT="aA"
GT="R"
GT="A"
```
TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "\~" to require at least one allele of the given type or the equal sign "=" to require that all alleles are of the given type. Compare
```
TYPE="snp"
TYPE~"snp"
TYPE!="snp"
TYPE!~"snp"
```

array subscripts (0-based), "*" for any element, "-" to indicate a range. Note that for querying FORMAT vectors, the colon ":" can be used to select a sample and an element of the vector, as shown in the examples below.

INFO/AF[0] > 0.3             .. first AF value bigger than 0.3
FORMAT/AD[0:0] > 30          .. first AD value of the first sample bigger than 30
FORMAT/AD[0:1]               .. first sample, second AD value
FORMAT/AD[1:0]               .. second sample, first AD value
DP4[*] == 0                  .. any DP4 value
FORMAT/DP[0]   > 30          .. DP of the first sample bigger than 30
FORMAT/DP[1-3] > 10          .. samples 2-4
FORMAT/DP[1-]  < 7           .. all samples but the first
FORMAT/DP[0,2-4] > 20        .. samples 1, 3-5
FORMAT/AD[0:1]               .. first sample, second AD field
FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
FORMAT/AD[*:1] or AD[:1]        .. any sample, second AD field
(DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
CSQ[*] ~ "missense_variant.*deleterious"

in addition to array subscripts shown above, it is possible to subscript arrays of Number=R tags by alleles found in FORMAT/GT (starting with version 1.17). For example

FORMAT/AD[GT] > 10        .. require support of more than 10 reads for each allele
FORMAT/AD[0:GT] > 10      .. same as above, but in the first sample
sSUM(FORMAT/AD[GT]) > 20  .. require total sample depth bigger than 20

with many samples it can be more practical to provide a file with sample names, one sample name per line
```
GT[@samples.txt]="het" & binom(AD)<0.01
```
function on FORMAT tags (over samples) and INFO tags (over vector fields): maximum; minimum; arithmetic mean (AVG is synonymous with MEAN); median; standard deviation from mean; sum; string length; absolute value; number of elements:
```
MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT
```
Note that functions above evaluate to a single value across all samples and are intended to select sites, not samples, even when applied on FORMAT tags. However, when prefixed with SMPL_ (or "s" for brevity, e.g. SMPL_MAX or sMAX), they will evaluate to a vector of per-sample values when applied on FORMAT tags:
```
SMPL_MAX, SMPL_MIN, SMPL_AVG, SMPL_MEAN, SMPL_MEDIAN, SMPL_STDEV, SMPL_SUM,
sMAX, sMIN, sAVG, sMEAN, sMEDIAN, sSTDEV, sSUM
```

two-tailed binomial test. Note that for N=0 the test evaluates to a missing value and when FORMAT/GT is used to determine the vector indices, it evaluates to 1 for homozygous genotypes.

binom(FMT/AD)                .. GT can be used to determine the correct index
binom(AD[0],AD[1])           .. or the fields can be given explicitly
phred(binom())               .. the same as binom but phred-scaled

variables calculated on the fly if not present: number of alternate alleles; number of samples; count of alternate alleles; minor allele count (similar to AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN); frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes; number of samples with missing genotype; fraction of samples with missing genotype; indel length (deletions negative, insertions positive)
```
N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN
```
the number (N_PASS) or fraction (F_PASS) of samples which pass the expression
```
N_PASS(GQ>90 & GT!="mis") > 90
F_PASS(GQ>90 & GT!="mis") > 0.9
```
custom perl filtering. Note that this command is not compiled in by default, see the section Optional Compilation with Perl in the INSTALL file for help and misc/demo-flt.pl for a working example. The demo defined the perl subroutine "severity" which can be invoked from the command line as follows:
```
perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
```

Notes

String comparisons and regular expressions are case-insensitive
Comma in strings is interpreted as a separator and when multiple values are compared, the OR logic is used. Consequently, the following two expressions are equivalent but not the third:
```
-i 'TAG="hello,world"'
-i 'TAG="hello" || TAG="world"'
-i 'TAG="hello" && TAG="world"'
```
Variables and function names are case-insensitive, but not tag names. For example, "qual" can be used instead of "QUAL", "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".

When querying multiple values, all elements are tested and the OR logic is used on the result. For example, when querying "TAG=1,2,3,4", it will be evaluated as follows:

-i 'TAG[*]=1'   .. true, the record will be printed
-i 'TAG[*]!=1'  .. true
-e 'TAG[*]=1'   .. false, the record will be discarded
-e 'TAG[*]!=1'  .. false
-i 'TAG[0]=1'   .. true
-i 'TAG[0]!=1'  .. false
-e 'TAG[0]=1'   .. false
-e 'TAG[0]!=1'  .. true

When arithmetic operators are used on vectors A and B, the following logic is used to compute the resulting vector C:
- C_i = A_i + B_i when length(A)==B(A) and sets length©=length(A)
- C_i = A_i + B_0 when length(B)=1 and sets length©=length(A)
- C_i = A_0 + B_i when length(A)=1 and sets length©=length(B)
- throw an error when length(A)!=length(B) AND length(A)!=1 AND length(B)!=1

File formats this tool works with

VCFFASTA

bcftools consensus

Category