Category

Genomic Interval Manipulation


Usage

bedtools merge [OPTIONS] -i <BED/GFF/VCF/BAM>


Manual

This tool is part of the bedtools suite.

Required Arguments

  • -i bed/gff/vcf: The input file, which must be sorted by chrom, then start.

Options

  • -s: Force strandedness. That is, only merge features that are on the same strand. (By default, merging is done without respect to strand.)
  • -S str: Force merge for one specific strand only. Follow with + or - to force merge from only the forward or reverse strand, respectively. (By default, merging is done without respect to strand.)
  • -d INTEGER: Maximum distance between features allowed for features to be merged. (By default, 0. That is, overlapping & book-ended features are merged. Note: negative values enforce the number of b.p. required for overlap.)
  • -c: Specify columns from the B file to map onto intervals in A. Default: 5. Multiple columns can be specified in a comma-delimited list.
  • -o <str>: Specify the operation that should be applied to -c. Default: sum. Valid operations are:
    • sum - numeric only
    • count - numeric or text
    • count_distinct - numeric or text
    • min - numeric only
    • max - numeric only
    • mean - numeric only
    • median - numeric only
    • mode - the most frequent value, numeric or text
    • antimode - the least frequent value, numeric or text
    • stdev - numeric only
    • sstdev - (sample stddev)
    • collapse - print a comma-separated list, numeric or text (duplicates allowed)
    • distinct - print a comma-separated list (NO duplicates allowed), numeric or text
    • distinct_sort_num: as distinct, but sorted numerically, ascending
    • distinct_sort_num_desc: as distinct, but sorted numerically, descending
    • concat - print a comma-separated list, numeric or text
    • freqasc - print a comma-separated list of values observed and the number of times they were observed. Reported in ascending order of frequency.
    • freqdesc - print a comma-separated list of values observed and the number of times they were observed. Reported in descending order of frequency.
    • first: just the first value in the column
    • last: just the last value in the column

    Multiple operations can be specified in a comma-delimited list. If there is only one column, but multiple operations, all operations will be applied to that column. Likewise, if there is only one operation, but multiple columns, that operation will be applied to all columns. Otherwise, the number of columns must match the number of operations, and will be applied in respective order. E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5, the mean of column 4, and the count of column 6. The order of output columns will match the ordering given in the command.

  • -delim: Specify a custom delimiter for the collapse operations. Example: -delim "|". Default: ",".
  • -prec: Sets the decimal precision for output (Default: 5).
  • -bed: If using BAM input, write output as BED.
  • -header: Print the header from the A file prior to results.
  • -nobuf: Disable buffered output. Using this option will cause each line of output to be printed as it is generated, rather than saved in a buffer. This will make printing large output files noticeably slower, but can be useful in conjunction with other software tools and scripts that need to process one line of bedtools output at a time.
  • -iobuf: Specify the amount of memory to use for input buffer. Takes an integer argument. Optional suffixes K/M/G supported. Note: currently has no effect with compressed files.

Examples

Default behavior

By default, merge combines overlapping (by at least 1 bp) and/or bookended intervals into a single, “flattened” or “merged” interval.

$ cat A.bed
chr1  100  200
chr1  180  250
chr1  250  500
chr1  501  1000

$ bedtools merge -i A.bed
chr1  100  500
chr1  501  1000
Enforcing strandedness

The -s option will only merge intervals that are overlapping/bookended and are on the same strand.

$ cat A.bed
chr1  100  200   a1  1 +
chr1  180  250   a2  2 +
chr1  250  500   a3  3 -
chr1  501  1000  a4  4 +

$ bedtools merge -i A.bed -s
chr1  100  250
chr1  501  1000
chr1  250  500

To also report the strand, you could use the -c and -o operators (see below for more details):

$ bedtools merge -i A.bed -s -c 6 -o distinct
chr1  100 250 +
chr1  501 1000  +
Reporting merged intervals on a specific strand

The -S option will only merge intervals for a specific strand. For example, to only report merged intervals on the + strand:

$ cat A.bed
chr1  100  200   a1  1 +
chr1  180  250   a2  2 +
chr1  250  500   a3  3 -
chr1  501  1000  a4  4 +

$ bedtools merge -i A.bed -S +
chr1  100 250
chr1  501 1000

To also report the strand, you could use the -c and -o operators (see below for more details):

$ bedtools merge -i A.bed -S + -c 6 -o distinct
chr1  100 250 +
chr1  501 1000  +
Controlling how close two features must be in order to merge

By default, only overlapping or book-ended features are combined into a new feature. However, one can force merge to combine more distant features with the -d option. For example, were one to set -d 1000, any features that overlap or are within 1000 base pairs of one another will be combined.

$ cat A.bed
chr1  100  200
chr1  501  1000

$ bedtools merge -i A.bed
chr1  100  200
chr1  501  1000

$ bedtools merge -i A.bed -d 1000
chr1  100  200  1000
Applying operations to columns from merged intervals

When merging intervals, we often want to summarize or keep track of the values observed in specific columns (e.g., the feature name or score) from the original, unmerged intervals. When used together, the -c and -o options allow one to select specific columns (-c) and apply operation (-o) to each column. The result will be appended to the default, merged interval output. For example, one could use the following to report the count of intervals that we merged in each resulting interval (this replaces the -n option that existed prior to version 2.20.0).

$ cat A.bed
chr1  100  200
chr1  180  250
chr1  250  500
chr1  501  1000

$ bedtools merge -i A.bed -c 1 -o count
chr1  100  500  3
chr1  501  1000 1

We could also use these options to report the mean of the score (#5) field:

$ cat A.bed
chr1  100  200   a1  1 +
chr1  180  250   a2  2 +
chr1  250  500   a3  3 -
chr1  501  1000  a4  4 +

$ bedtools merge -i A.bed -c 5 -o mean
chr1  100 500 2
chr1  501 1000  4

Let’s get fancy and report the mean, min, and max of the score column:

$ bedtools merge -i A.bed -c 5 -o mean,min,max
chr1  100 500 2 1 3
chr1  501 1000  4 4 4

Let’s also report a comma-separated list of the strands:

$ bedtools merge -i A.bed -c 5,5,5,6 -o mean,min,max,collapse
chr1  100 500 2 1 3 +,+,-
chr1  501 1000  4 4 4 +

Hopefully this provides a clear picture of what can be done.

File formats this tool works with
BEDGFFGTFVCF

Share your experience or ask a question