bedtools groupby manual with usage examples

Usage

bedtools groupby [OPTIONS] -i <input> -g <group columns> -c <op. column> -o <operation>

Manual

This tool is part of the bedtools suite.

bedtools groupby is a useful tool that mimics the group by clause in database systems. Given a file or stream that is sorted by the appropriate grouping columns (-g), groupby will compute summary statistics on another column (-c) in the file or stream. This will work with output from all BEDTools as well as any other tab-delimited file or stream. As such, this is a generally useful tool for all command-line analyses, not just genomics related research.

Related tools: bedtools merge

Required arguments

-i: The input file that should be grouped and summarized. Use stdin when using piped input. Note: if -i is omitted, input is assumed to come from standard input (stdin).
The input data must be ordered by the same columns as specified with the -grp argument, which establish which columns should be used to define a group of similar data.
For example, if you want to group by the first three columns (-grp 1,2,3), the data should be pre-grouped accordingly (with commands like sort -k1,1 -k2,2 -k3,3 data.txt). When bedtools groupby detects changes in the group columns it then summarizes all lines with that group.
-g <str>, -grp <str>: Specifies which column(s) (1-based) should be used to group the input. Columns may be comma-separated, with each column must be explicitly listed. Or, ranges (e.g., 1-4) are also allowed. Default: 1,2,3.
-c <integer>, -opCols <integer>: Specify the column (1-based) that should be summarized. Required.

Options

-o <str>, -op <str>: Specify the operation that should be applied to -opCols. Default: sum. Valid operations are:
- sum - numeric only
- count - numeric or text
- count_distinct - numeric or text
- min - numeric only
- max - numeric only
- mean - numeric only
- median - numeric only
- mode - the most frequent value, numeric or text
- antimode - the least frequent value, numeric or text
- stdev - numeric only
- sstdev - (sample stddev)
- collapse - print a comma-separated list, numeric or text (duplicates allowed)
- distinct - print a comma-separated list (NO duplicates allowed), numeric or text
- distinct_sort_num: as distinct, but sorted numerically, ascending
- distinct_sort_num_desc: as distinct, but sorted numerically, descending
- concat - print a comma-separated list, numeric or text
- freqasc - print a comma-separated list of values observed and the number of times they were observed. Reported in ascending order of frequency.
- freqdesc - print a comma-separated list of values observed and the number of times they were observed. Reported in descending order of frequency.
- first: print first value
- last: print last value
If there is only one column, but multiple operations, all operations will be applied on that column. Likewise, if there is only one operation, but multiple columns, that operation will be applied to all columns. Otherwise, the number of columns must match the the number of operations, and will be applied in respective order. E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5, the mean of column 4, and the count of column 6. The order of output columns will match the ordering given in the command.
-full: Print all columns from input file. The first line in the group is used. Default: print only grouped columns.
-inheader: Input file has a header line - the first line will be ignored.
-outheader: Print header line in the output, detailing the column names. If the input file has headers (-inheader), the output file will use the input's column names. If the input file has no headers, the output file will use "col_1", "col_2", etc. as the column names.
-header: same as -inheader -outheader
-ignorecase: Group values regardless of upper/lower case.
-prec: Sets the decimal precision for output (Default: 5)
-delim: Specify a custom delimiter for the collapse operations. Default: ",". If you want to switch the delimiter to |, set -delim "|".

Examples

Default behavior

Let’s imagine we have three incredibly interesting genetic variants that we are studying and we are interested in what annotated repeats these variants overlap.

$ cat variants.bed
chr21  9719758 9729320 variant1
chr21  9729310 9757478 variant2
chr21  9795588 9796685 variant3

$ bedtools intersect -a variants.bed -b repeats.bed -wa -wb > variantsToRepeats.bed
$ cat variantsToRepeats.bed
chr21  9719758 9729320 variant1   chr21  9719768 9721892 ALR/Alpha   1004  +
chr21  9719758 9729320 variant1   chr21  9721905 9725582 ALR/Alpha   1010  +
chr21  9719758 9729320 variant1   chr21  9725582 9725977 L1PA3       3288  +
chr21  9719758 9729320 variant1   chr21  9726021 9729309 ALR/Alpha   1051  +
chr21  9729310 9757478 variant2   chr21  9729320 9729809 L1PA3       3897  -
chr21  9729310 9757478 variant2   chr21  9729809 9730866 L1P1        8367  +
chr21  9729310 9757478 variant2   chr21  9730866 9734026 ALR/Alpha   1036  -
chr21  9729310 9757478 variant2   chr21  9734037 9757471 ALR/Alpha   1182  -
chr21  9795588 9796685 variant3   chr21  9795589 9795713 (GAATG)n    308   +
chr21  9795588 9796685 variant3   chr21  9795736 9795894 (GAATG)n    683   +
chr21  9795588 9796685 variant3   chr21  9795911 9796007 (GAATG)n    345   +
chr21  9795588 9796685 variant3   chr21  9796028 9796187 (GAATG)n    756   +
chr21  9795588 9796685 variant3   chr21  9796202 9796615 (GAATG)n    891   +
chr21  9795588 9796685 variant3   chr21  9796637 9796824 (GAATG)n    621   +

We can see that variant1 overlaps with 3 repeats, variant2 with 4 and variant3 with 6. We can use bedtools groupby to summarize the hits for each variant in several useful ways. The default behavior is to compute the sum of the -opCols.

$ bedtools groupby -i variantsToRepeats.bed -g 1,2,3 -c 9
chr21 9719758 9729320 6353
chr21 9729310 9757478 14482
chr21 9795588 9796685 3604

Computing the min and max

Now let’s find the min and max repeat score for each variant. We do this by grouping on the variant coordinate columns (i.e. cols. 1,2 and 3) and ask for the min and max of the repeat score column (i.e. col. 9).

$ bedtools groupby -i variantsToRepeats.bed -g 1,2,3 -c 9 -o min
chr21 9719758 9729320 1004
chr21 9729310 9757478 1036
chr21 9795588 9796685 308

We can also group on just the name column with similar effect.

$ bedtools groupby -i variantsToRepeats.bed -g 4 -c 9 -o min
variant1 1004
variant2 1036
variant3 308

File formats this tool works with

BED

bedtools groupby

Category