Category

Genome Variant Analysis


Usage

bcftools consensus [OPTIONS] <file.vcf.gz>


Manual

bcftools consensus is a command in the BCFtools suite.

Sometimes there is the need to create a consensus sequence for an individual where the sequence incorporates variants typed for this individual (via the --samples option). This is possible using the consensus command.

We need the reference sequence reference.fa in the fasta format and an indexed VCF with the variants calls.bcf. The command is:

$ cat reference.fa | bcftools consensus calls.vcf.gz > consensus.fa

In order to keep the program relatively simple, consensus merely applies variants from the VCF regardless of AF or other properties, and symbolic alleles other than <DEL>, <*> or <NON_REF> are currently not supported, e.g. <INS>. If more more advanced logic is required, the VCF has to be preprocessed with other tools like FastaAlternateReferenceMaker from the GATK suite.

Required arguments

  • file.vcf.gz: Input VCF file

Options

  • -c, --chain FILE: Write a chain file for liftover
  • -a, --absent CHAR: Replace positions absent from VCF with CHAR
  • -e, --exclude EXPR: Exclude sites for which the EXPR is true
  • -f, --fasta-ref FILE: Reference sequence in fasta format. You can also specify the reference sequence via pipe, in this case, you don't need to specify the -f option.
  • -H, --haplotype WHICH: Choose which allele to use from the FORMAT/GT field, note the codes are case-insensitive:
    • N: N={1,2,3,...} is the index of the allele from GT, regardless of phasing (e.g. "2", the second allele, regardless of phasing)
    • R: REF allele in heterozygous genotypes
    • A: ALT allele in heterozygous genotypes
    • I: IUPAC code for all genotypes (automatically triggers -i)
    • LR, LA: the longer allele. If both have the same length, use the REF allele (LR), or the ALT allele (LA)
    • SR, SA: the shorter allele. If both have the same length, use the REF allele (SR), or the ALT allele (SA)
    • 1pIu, 2pIu: first/second allele for phased genotypes and IUPAC code for unphased genotypes. This option requires -s, unless exactly one sample is present in the VCF
  • -i, --include EXPR: Select sites for which the expression is true (see man page for details)
  • -I, --iupac-codes: output variants in the form of IUPAC ambiguity codes determined from FORMAT/GT fields. By default all samples are used and can be subset with -s, --samples and -S, --samples-file. Use -s - to ignore samples and use only the REF and ALT columns. NOTE: prior to version 1.17 the IUPAC codes were determined solely from REF,ALT columns and sample genotypes were not considered.
  • --mark-del CHAR: Instead of removing sequence, insert character CHAR for deletions
  • --mark-ins uc|lc|CHAR: Highlight insertions in uppercase (uc), lowercase (lc), or use CHAR, leaving the rest as is
  • --mark-snv uc|lc|CHAR: Highlight substitutions in uppercase (uc), lowercase (lc), or use CHAR, leaving the rest as is
  • -m, --mask FILE: Replace regions according to the next --mask-with option. The default is --mask-with N
  • --mask-with CHAR|uc|lc: Replace with CHAR (skips overlapping variants); change to uppercase (uc) or lowercase (lc)
  • -M, --missing CHAR: Output CHAR instead of skipping a missing genotype "./."
  • -o, --output FILE: Write output to a file. By default, output will be directed to standard output.
  • -p, --prefix STRING: Prefix to add to output sequence names
  • -s, --samples LIST: Comma-separated list of samples to include, "-" to ignore samples and use REF,ALT
  • -S, --samples-file FILE: File of samples to include

Examples

Apply variants present in sample "NA001"

The following example will apply variants present in sample "NA001", output IUPAC codes for heterogeneity

$ bcftools consensus -s NA001 -f in.fa in.vcf.gz > out.fa

Output IUPAC ambiguity codes based on REF+ALT columns (regardless of genotype)

$ cat reference.fa | bcftools consensus --iupac-codes calls.vcf.gz > consensus.fa

Output IUPAC ambiguity codes based on sample genotypes

$ cat reference.fa | bcftools consensus --haplotype I calls.vcf.gz > consensus.fa
Get the consensus for one region
$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa

Appendix: Valid expressions

Valid expressions may contain
  • numerical constants (e.g. 1, 1.0, 1e-4), string constants (e.g. "String"), file names (e.g. @file_name, this is currently supported only to filter by the ID column)
  • arithmetic operators (addition +, multiplication *, subtraction -, division /, modulo %)
  • comparison operators: == (same as =), >, >=, <=, <, !=
  • regex operators "\~" and its negation "!~". The expressions are case sensitive unless "/i" is added.
  • parentheses: (, )
  • logical operators. &&, &|||.
  • INFO tags, FORMAT tags, column names: INFO/DP or DP FORMAT/DV, FMT/DV, or DV FILTER, QUAL, ID, CHROM, POS, REF, ALT[0]
  • the FILTER column can be queried as follows:
    FILTER="PASS"
    FILTER="."
    FILTER="A"          .. exact match, for example "A;B" does not pass
    FILTER="A;B"        .. exact match, "A;B" and "B;A" pass, everything else fails
    FILTER!="A"         .. exact match, for example "A;B" does pass
    FILTER~"A"          .. subset match, for example both "A" and "A;B" pass
    FILTER~"A;B"        .. subset match, pass only if both "A" and "B" are present
    FILTER!~"A"         .. complement match, for example both "A" and "A;B" fail
    FILTER!~"A;B"       .. complement match, fail if both "A" and "B" are present
  • 1 (or 0) to test the presence (or absence) of a flag

    FlagA=1 && FlagB=0
  • "." to test missing values

    DP=".", DP!=".", ALT="."
  • missing genotypes can be matched regardless of phase and ploidy (".|.", "./.", ".", "0|.") using these expressions

    GT="mis", GT~"\.", GT!~"\."
  • missing genotypes can be matched including the phase and ploidy (".|.", "./.", ".") using these expressions

    GT=".|.", GT="./.", GT="."
  • sample genotype: reference (haploid or diploid), alternate (hom or het, haploid or diploid), missing genotype, homozygous, heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het, alt-alt het, haploid ref, haploid alt (case-insensitive)

    GT="ref"
    GT="alt"
    GT="mis"
    GT="hom"
    GT="het"
    GT="hap"
    GT="RR"
    GT="AA"
    GT="RA" or GT="AR"
    GT="Aa" or GT="aA"
    GT="R"
    GT="A"
  • TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "\~" to require at least one allele of the given type or the equal sign "=" to require that all alleles are of the given type. Compare

    TYPE="snp"
    TYPE~"snp"
    TYPE!="snp"
    TYPE!~"snp"
  • array subscripts (0-based), "*" for any element, "-" to indicate a range. Note that for querying FORMAT vectors, the colon ":" can be used to select a sample and an element of the vector, as shown in the examples below.

    INFO/AF[0] > 0.3             .. first AF value bigger than 0.3
    FORMAT/AD[0:0] > 30          .. first AD value of the first sample bigger than 30
    FORMAT/AD[0:1]               .. first sample, second AD value
    FORMAT/AD[1:0]               .. second sample, first AD value
    DP4[*] == 0                  .. any DP4 value
    FORMAT/DP[0]   > 30          .. DP of the first sample bigger than 30
    FORMAT/DP[1-3] > 10          .. samples 2-4
    FORMAT/DP[1-]  < 7           .. all samples but the first
    FORMAT/DP[0,2-4] > 20        .. samples 1, 3-5
    FORMAT/AD[0:1]               .. first sample, second AD field
    FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field
    FORMAT/AD[*:1] or AD[:1]        .. any sample, second AD field
    (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
    CSQ[*] ~ "missense_variant.*deleterious"
  • in addition to array subscripts shown above, it is possible to subscript arrays of Number=R tags by alleles found in FORMAT/GT (starting with version 1.17). For example

    FORMAT/AD[GT] > 10        .. require support of more than 10 reads for each allele
    FORMAT/AD[0:GT] > 10      .. same as above, but in the first sample
    sSUM(FORMAT/AD[GT]) > 20  .. require total sample depth bigger than 20
  • with many samples it can be more practical to provide a file with sample names, one sample name per line

    GT[@samples.txt]="het" & binom(AD)<0.01
  • function on FORMAT tags (over samples) and INFO tags (over vector fields): maximum; minimum; arithmetic mean (AVG is synonymous with MEAN); median; standard deviation from mean; sum; string length; absolute value; number of elements:

    MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT

    Note that functions above evaluate to a single value across all samples and are intended to select sites, not samples, even when applied on FORMAT tags. However, when prefixed with SMPL_ (or "s" for brevity, e.g. SMPL_MAX or sMAX), they will evaluate to a vector of per-sample values when applied on FORMAT tags:

    SMPL_MAX, SMPL_MIN, SMPL_AVG, SMPL_MEAN, SMPL_MEDIAN, SMPL_STDEV, SMPL_SUM,
    sMAX, sMIN, sAVG, sMEAN, sMEDIAN, sSTDEV, sSUM
  • two-tailed binomial test. Note that for N=0 the test evaluates to a missing value and when FORMAT/GT is used to determine the vector indices, it evaluates to 1 for homozygous genotypes.

    binom(FMT/AD)                .. GT can be used to determine the correct index
    binom(AD[0],AD[1])           .. or the fields can be given explicitly
    phred(binom())               .. the same as binom but phred-scaled
  • variables calculated on the fly if not present: number of alternate alleles; number of samples; count of alternate alleles; minor allele count (similar to AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN); frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes; number of samples with missing genotype; fraction of samples with missing genotype; indel length (deletions negative, insertions positive)

    N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN
  • the number (N_PASS) or fraction (F_PASS) of samples which pass the expression

    N_PASS(GQ>90 & GT!="mis") > 90
    F_PASS(GQ>90 & GT!="mis") > 0.9
  • custom perl filtering. Note that this command is not compiled in by default, see the section Optional Compilation with Perl in the INSTALL file for help and misc/demo-flt.pl for a working example. The demo defined the perl subroutine "severity" which can be invoked from the command line as follows:

    perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
Notes
  • String comparisons and regular expressions are case-insensitive

  • Comma in strings is interpreted as a separator and when multiple values are compared, the OR logic is used. Consequently, the following two expressions are equivalent but not the third:

    -i 'TAG="hello,world"'
    -i 'TAG="hello" || TAG="world"'
    -i 'TAG="hello" && TAG="world"'
  • Variables and function names are case-insensitive, but not tag names. For example, "qual" can be used instead of "QUAL", "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".

  • When querying multiple values, all elements are tested and the OR logic is used on the result. For example, when querying "TAG=1,2,3,4", it will be evaluated as follows:

    -i 'TAG[*]=1'   .. true, the record will be printed
    -i 'TAG[*]!=1'  .. true
    -e 'TAG[*]=1'   .. false, the record will be discarded
    -e 'TAG[*]!=1'  .. false
    -i 'TAG[0]=1'   .. true
    -i 'TAG[0]!=1'  .. false
    -e 'TAG[0]=1'   .. false
    -e 'TAG[0]!=1'  .. true
  • When arithmetic operators are used on vectors A and B, the following logic is used to compute the resulting vector C:

    • C_i = A_i + B_i when length(A)==B(A) and sets length©=length(A)

    • C_i = A_i + B_0 when length(B)=1 and sets length©=length(A)

    • C_i = A_0 + B_i when length(A)=1 and sets length©=length(B)

    • throw an error when length(A)!=length(B) AND length(A)!=1 AND length(B)!=1

 

File formats this tool works with
VCFFASTA

Share your experience or ask a question