Category

Genome Variant Analysis


Usage

bcftools concat [options] <A.vcf.gz> [<B.vcf.gz> [...]]


Manual

bcftools concat is a command in the BCFtools suite.

Concatenate or combine VCF/BCF files. All source files must have the same sample columns appearing in the same order. Can be used, for example, to concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel VCF into one. The input files must be sorted by chr and position. The files must be given in the correct order to produce sorted VCF on output unless the -a, --allow-overlaps option is specified. With the --naive option, the files are concatenated without being recompressed, which is very fast.

Related tools: bcftools merge

Required arguments

  • A.vcf.gz: Input VCF/BCF file(s) to concatenate
  • [ [...]]: Additional input VCF/BCF file(s) to concatenate (optional)

Options

  • -a, --allow-overlaps: First coordinate of the next file can precede the last record of the current file.
  • -c, --compact-PS: Do not output PS tag at each site, only at the start of a new phase set block.
  • -d, --rm-dups STRING: Output duplicate records of specified type present in multiple files only once. Note that records duplicate within one file are not removed with this option, for that use bcftools norm -d instead.
    In other words, the default behavior of the program is similar to unix cat in that when two files contain a record with the same position, that position will appear twice on output. With -d, every line that finds a matching record in another file will be printed only once. Requires -a--allow-overlaps.
    • snps: For SNPs only
    • indels: For indels only
    • both: For both SNPs and indels
    • all: For all duplicates
    • exact: For exact duplicates
  • -D, --remove-duplicates: Alias for -d exact
  • -f, --file-list FILE: Read the list of files from a file, one file name per line.
  • -G, --drop-genotypes: Drop individual genotype information.
  • -l, --ligate: Ligate phased VCFs by matching phase at overlapping haplotypes. Note that the option is intended for VCFs with perfect overlap, sites in overlapping regions present in one but missing in the other are dropped.
  • --ligate-force: Keep all sites and ligate even non-overlapping chunks and chunks with imperfect overlap.
  • --ligate-warn: Drop sites in imperfect overlaps
  • --no-version: Do not append version and command line to the header
  • -n, --naive: Concatenate VCF or BCF files without recompression. This is very fast but requires that all files are of the same type (all VCF or all BCF) and have the same headers. This is because all tags and chromosome names in the BCF body rely on the order of the contig and tag definitions in the header. A header check compatibility is performed and the program throws an error if it is not safe to use the option.
  • --naive-force: Same as --naive, but header compatibility is not checked. Dangerous, use with caution.
  • -o, --output FILE: Write output to a file. By default, output will be directed to the standard output.
  • -O, --output-type u|b|v|z[0-9]: Output compressed BCF (b), uncompressed BCF (u), compressed VCF (z), uncompressed VCF (v). Use the -O u option when piping between bcftools subcommands to speed up performance by removing unnecessary compression/decompression and VCF/BCF conversion. The compression level of the compressed formats (b and z) can be set by by appending a number between 0-9.
  • -q, --min-PQ INT: Break phase set if phasing quality is lower than int [30]
  • -r, --regions REGION: Restrict to a comma-separated list of regions. Requires -a--allow-overlaps.
  • -R, --regions-file FILE: Restrict to regions listed in a file. Requires -a--allow-overlaps.
  • --regions-overlap 0|1|2: Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
  • --threads INT: Use multithreading with int worker threads [0]
  • -v, --verbose 0|1: Set verbosity level [1]
  • --write-index: Automatically index the output files

Examples

Combine VCFs for all chromosomes into a single file

In the following example, we will combine all the chromosome VCFs from the 1000 genome project into a single VCF file:

bcftools concat ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 
-O z -o ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

File formats this tool works with
VCFBCF

Share your experience or ask a question