Reference Code backup Executable files
Generate a consensus sequence by applying variants from a VCF file to a reference genome, producing a personalized genomic sequence for a specific individual based on their genetic variants.
bcftools consensus [OPTIONS] <file.vcf.gz>
bcftools consensus
is a command in the BCFtools suite.
Sometimes there is the need to create a consensus sequence for an individual where the sequence incorporates variants typed for this individual (via the --samples option). This is possible using the consensus
command.
We need the reference sequence reference.fa in the fasta format and an indexed VCF with the variants calls.bcf. The command is:
$ cat reference.fa | bcftools consensus calls.vcf.gz > consensus.fa
In order to keep the program relatively simple, consensus
merely applies variants from the VCF regardless of AF or other properties, and symbolic alleles other than <DEL>, <*> or <NON_REF> are currently not supported, e.g. <INS>. If more more advanced logic is required, the VCF has to be preprocessed with other tools like FastaAlternateReferenceMaker
from the GATK suite.
-s -
to ignore samples and use only the REF and ALT columns. NOTE: prior to version 1.17 the IUPAC codes were determined solely from REF,ALT columns and sample genotypes were not considered.--mask-with N
The following example will apply variants present in sample "NA001", output IUPAC codes for heterogeneity
$ bcftools consensus-s NA001
-f in.fa
in.vcf.gz > out.fa
Output IUPAC ambiguity codes based on REF+ALT columns (regardless of genotype)
$ cat reference.fa | bcftools consensus --iupac-codes calls.vcf.gz > consensus.fa
Output IUPAC ambiguity codes based on sample genotypes
$ cat reference.fa | bcftools consensus --haplotype I
calls.vcf.gz > consensus.fa
$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa
FILTER="PASS" FILTER="." FILTER="A" .. exact match, for example "A;B" does not pass FILTER="A;B" .. exact match, "A;B" and "B;A" pass, everything else fails FILTER!="A" .. exact match, for example "A;B" does pass FILTER~"A" .. subset match, for example both "A" and "A;B" pass FILTER~"A;B" .. subset match, pass only if both "A" and "B" are present FILTER!~"A" .. complement match, for example both "A" and "A;B" fail FILTER!~"A;B" .. complement match, fail if both "A" and "B" are present
1 (or 0) to test the presence (or absence) of a flag
FlagA=1 && FlagB=0
"." to test missing values
DP=".", DP!=".", ALT="."
missing genotypes can be matched regardless of phase and ploidy (".|.", "./.", ".", "0|.") using these expressions
GT="mis", GT~"\.", GT!~"\."
missing genotypes can be matched including the phase and ploidy (".|.", "./.", ".") using these expressions
GT=".|.", GT="./.", GT="."
sample genotype: reference (haploid or diploid), alternate (hom or het, haploid or diploid), missing genotype, homozygous, heterozygous, haploid, ref-ref hom, alt-alt hom, ref-alt het, alt-alt het, haploid ref, haploid alt (case-insensitive)
GT="ref" GT="alt" GT="mis" GT="hom" GT="het" GT="hap" GT="RR" GT="AA" GT="RA" or GT="AR" GT="Aa" or GT="aA" GT="R" GT="A"
TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap). Use the regex operator "\~" to require at least one allele of the given type or the equal sign "=" to require that all alleles are of the given type. Compare
TYPE="snp" TYPE~"snp" TYPE!="snp" TYPE!~"snp"
array subscripts (0-based), "*" for any element, "-" to indicate a range. Note that for querying FORMAT vectors, the colon ":" can be used to select a sample and an element of the vector, as shown in the examples below.
INFO/AF[0] > 0.3 .. first AF value bigger than 0.3 FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30 FORMAT/AD[0:1] .. first sample, second AD value FORMAT/AD[1:0] .. second sample, first AD value DP4[*] == 0 .. any DP4 value FORMAT/DP[0] > 30 .. DP of the first sample bigger than 30 FORMAT/DP[1-3] > 10 .. samples 2-4 FORMAT/DP[1-] < 7 .. all samples but the first FORMAT/DP[0,2-4] > 20 .. samples 1, 3-5 FORMAT/AD[0:1] .. first sample, second AD field FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field FORMAT/AD[*:1] or AD[:1] .. any sample, second AD field (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3 CSQ[*] ~ "missense_variant.*deleterious"
in addition to array subscripts shown above, it is possible to subscript arrays of Number=R tags by alleles found in FORMAT/GT (starting with version 1.17). For example
FORMAT/AD[GT] > 10 .. require support of more than 10 reads for each allele FORMAT/AD[0:GT] > 10 .. same as above, but in the first sample sSUM(FORMAT/AD[GT]) > 20 .. require total sample depth bigger than 20
with many samples it can be more practical to provide a file with sample names, one sample name per line
GT[@samples.txt]="het" & binom(AD)<0.01
function on FORMAT tags (over samples) and INFO tags (over vector fields): maximum; minimum; arithmetic mean (AVG is synonymous with MEAN); median; standard deviation from mean; sum; string length; absolute value; number of elements:
MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT
Note that functions above evaluate to a single value across all samples and are intended to select sites, not samples, even when applied on FORMAT tags. However, when prefixed with SMPL_ (or "s" for brevity, e.g. SMPL_MAX or sMAX), they will evaluate to a vector of per-sample values when applied on FORMAT tags:
SMPL_MAX, SMPL_MIN, SMPL_AVG, SMPL_MEAN, SMPL_MEDIAN, SMPL_STDEV, SMPL_SUM, sMAX, sMIN, sAVG, sMEAN, sMEDIAN, sSTDEV, sSUM
two-tailed binomial test. Note that for N=0 the test evaluates to a missing value and when FORMAT/GT is used to determine the vector indices, it evaluates to 1 for homozygous genotypes.
binom(FMT/AD) .. GT can be used to determine the correct index binom(AD[0],AD[1]) .. or the fields can be given explicitly phred(binom()) .. the same as binom but phred-scaled
variables calculated on the fly if not present: number of alternate alleles; number of samples; count of alternate alleles; minor allele count (similar to AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN); frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes; number of samples with missing genotype; fraction of samples with missing genotype; indel length (deletions negative, insertions positive)
N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN
the number (N_PASS) or fraction (F_PASS) of samples which pass the expression
N_PASS(GQ>90 & GT!="mis") > 90 F_PASS(GQ>90 & GT!="mis") > 0.9
custom perl filtering. Note that this command is not compiled in by default, see the section Optional Compilation with Perl in the INSTALL file for help and misc/demo-flt.pl for a working example. The demo defined the perl subroutine "severity" which can be invoked from the command line as follows:
perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
String comparisons and regular expressions are case-insensitive
Comma in strings is interpreted as a separator and when multiple values are compared, the OR logic is used. Consequently, the following two expressions are equivalent but not the third:
-i 'TAG="hello,world"' -i 'TAG="hello" || TAG="world"' -i 'TAG="hello" && TAG="world"'
Variables and function names are case-insensitive, but not tag names. For example, "qual" can be used instead of "QUAL", "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".
When querying multiple values, all elements are tested and the OR logic is used on the result. For example, when querying "TAG=1,2,3,4", it will be evaluated as follows:
-i 'TAG[*]=1' .. true, the record will be printed -i 'TAG[*]!=1' .. true -e 'TAG[*]=1' .. false, the record will be discarded -e 'TAG[*]!=1' .. false -i 'TAG[0]=1' .. true -i 'TAG[0]!=1' .. false -e 'TAG[0]=1' .. false -e 'TAG[0]!=1' .. true
When arithmetic operators are used on vectors A and B, the following logic is used to compute the resulting vector C:
C_i = A_i + B_i when length(A)==B(A) and sets length©=length(A)
C_i = A_i + B_0 when length(B)=1 and sets length©=length(A)
C_i = A_0 + B_i when length(A)=1 and sets length©=length(B)
throw an error when length(A)!=length(B) AND length(A)!=1 AND length(B)!=1