Category

Genomic Interval Manipulation


Usage

bedtools fisher [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam> -g <genome file>


Manual

This tool is part of the bedtools suite, and it's also known as fisher.

How it works?

This implementation first calculates the number of overlaps and the number of intervals unique to each file and it infers (or accepts) the number that are not present in each file. Then it constructs a contingency table and performs the Fisher's exact test.

  in -b not in -b
in -a

Number of overlaps (denote as n11)

Number of query intervals - Number of overlaps (denote as n12)

not in -a

Number of db intervals - Number of overlaps (denote as n21)

Number of possible intervals - n11 - n12 - n21

The total number of possible intervals is based on a heuristic that uses the mean sizes of intervals in the a and b sets and the size of the genome. For example, if the average sizes of intervals in a and b are 100 and 150, respectively, and the genome has 5000 bps, then this implementation estimates around 20 possible intervals in total. Before using this tool, please carefully consider if this heuristic fits your assumption.

Required arguments

  • -a <bed/gff/vcf/bam>: BED/GFF/VCF/BAM file A. Each feature in A is compared to B in search of overlaps. Use stdin if passing A with a UNIX pipe.
  • -b <bed/gff/vcf>: BED/GFF/VCF/BAM file B. Use stdin if passing B with a UNIX pipe.
  • -g <genome>: genome file listing chromosome size. This can be retrieved with tools like fetchChromSizes.
Notes
  1. Both regions in -a and -b need to be pre-sorted by chromosome and then by start position (e.g., sort -k1,1 -k2,2n in.bed > in.sorted.bed or bedtools sort for BED files).
  2. If you use bam files as input for -a or -b, remember to turn on the -bed option.

Options

  • -m: Merge overlapping intervals before looking at overlap.
  • -f <float>: Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
  • -F <float>: Minimum overlap required as a fraction of B. Default is 1E-9 (i.e. 1bp).
  • -e <float>: Require that the minimum fraction be satisfied for A OR B. In other words, if -e is used with -f 0.90 and -F 0.10 this requires that either 90% of A is covered OR 10% of B is covered. Without -e, both fractions would have to be satisfied.
  • -r: Require that the fraction of overlap be reciprocal for A and B. In other words, if -f 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
  • -s: Require same strandedness. That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
  • -S: Require different strandedness. That is, only report hits in B that overlap A on the opposite strand. By default, overlaps are reported without respect to strand.
  • -nonamecheck: For sorted data, don't throw an error if the file has different naming conventions for the same chromosome. ex. "chr1" vs "chr01".
  • -bed: If using BAM input, write output as BED.
  • -split: Treat split BAM (i.e., having an “N” CIGAR operation) or BED12 entries as distinct BED intervals.
  • -header: Print the header from the A file prior to results.
  • -nobuf: Disable buffered output. Using this option will cause each line of output to be printed as it is generated, rather than saved in a buffer. This will make printing large output files noticeably slower, but can be useful in conjunction with other software tools and scripts that need to process one line of bedtools output at a time.
  • -iobuf <int>: Specify amount of memory to use for input buffer. Takes an integer argument. Optional suffixes K/M/G supported. Note: currently has no effect with compressed files.

Example

$ bedtools fisher -a gcp_chr22.bam -b chr22.test.bed -g GRCh38_no_alt_analysis_set_GCA_000001405.15.genome -bed
# Number of query intervals: 926535
# Number of db intervals: 714888
# Number of overlaps: 622725
# Number of possible intervals (estimated): 9061211
# phyper(622725 - 1, 926535, 9061211 - 926535, 714888, lower.tail=F)
# Contingency Table Of Counts
#_________________________________________
#           |  in -b       | not in -b    |
#     in -a | 622725       | 303810       |
# not in -a | 92163        | 8042513      |
#_________________________________________
# p-values for fisher's exact test
left    right    two-tail    ratio
1    0    0    178.867

File formats this tool works with
BEDBAMGFFGTFVCF

Share your experience or ask a question