Category

Sam/Bam Manipulation


Usage

samtools merge [options] out.bam in1.bam ... inN.bam

samtools merge [options] -o out.bam [options] in1.bam ... inN.bam


Manual

Merge multiple sorted alignment files, producing a single sorted output file that contains all the input records and maintains the existing sort order.

Required arguments

  • out.bam: The first bam will the the destination of the merged bam. If you prefer, you can use the -o option to specify the destination. When -o is used, all non-option filename arguments specify input files to be merged. To write to standard output (or to a pipe), use either -o - or the equivalent using - as the first filename argument.
  • in1.bam ... inN.bam: bam files to be merged. You can also use the -b option to provide a plain text file which stores a list of bam files that you want to merge.

Options

  • -1: Use Deflate compression level 1 to compress the output.
  • -b file: List of input BAM files, one file per line.
  • -f: Force to overwrite the output file if present.
  • -h file: When the -h option is not used, samtools merges all headers from the input bams. By providing a SAM/BAM FILE to the -o option, the lines of FILE as `@' headers to be copied to out.bam, replacing any header lines that would otherwise be copied from in1.bam. If in the process of merging @SQ lines for coordinate sorted input files, a conflict arises as to the order (for example input1.bam has @SQ (reference sequences) for a,b,c and input2.bam has b,a,c) then the resulting output file will need to be re-sorted back into coordinate order.
  • -n: The input alignments are sorted by read names using an alpha-numeric ordering, rather than by chromosomal coordinates.
  • -N: The input alignments are sorted by read names using a lexicographical ordering, rather than by chromosomal coordinates.
  • -o file: Write merged output to FILE, specifying the filename via an option rather than as the first filename argument. When -o is used, all non-option filename arguments specify input files to be merged.
  • -t tag: The input alignments have been sorted by the value of TAG, then by either position or name (if -n is given).
  • -R str: Merge files in the specified region indicated by STR [null].
  • -r: Attach an RG tag to each alignment. The tag value is inferred from file names.
  • -u: Uncompressed BAM output.
  • -c: When several input files contain @RG headers with the same ID, emit only one of them (namely, the header line from the first file we find that ID in) to the merged output file. Combining these similar headers is usually the right thing to do when the files being merged originated from the same file.
  • -p: Similarly, for each @PG ID in the set of files to merge, use the @PG line of the first file we find that ID in rather than adding a suffix to differentiate similar IDs.
  • -X: If this option is set, it will allow the user to specify customized index file location(s) if the data folder does not contain any index file.
  • -L file: BED file for specifying multiple regions on which the merge will be performed. This option extends the usage of -R option and cannot be used concurrently with it.
  • --no-PG: Do not add a @PG line to the header of the output file.
  • -@, --threads int: Number of input/output compression threads to use in addition to the main thread [0].

Notes

Unless the -c or -p flags are specified then when merging @RG and @PG records into the output header then any IDs found to be duplicates of existing IDs in the output header will have a suffix appended to them to differentiate them from similar header records from other files and the read records will be updated to reflect this.

The ordering of the records in the input files must match the usage of the -n-N and -t command-line options. If they do not, the output order will be undefined. Note this also extends to disallowing mixing of "queryname" files with a combination of natural and lexicographical sort orders.

Problems may arise when attempting to merge thousands of files together. The operating system may impose a limit on the maximum number of simultaneously open files. Additionally many files being read from simultaneously may cause a certain amount of "disk thrashing". To partially alleviate this the merge command will load 1MB of data at a time from each file, but this in turn adds to the overall merge program memory usage. Please take this into account when setting memory limits. In extreme cases, it may be necessary to reduce the problem to fewer files by successively merging subsets before a second round of merging.

Examples

The following example attaches the RG tag while merging sorted alignments:

$ printf '@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:ILLUMINA\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:LS454\\n' > rg.txt
$ samtools merge -r -h rg.txt merged.bam ga.bam 454.bam

The value in a RG tag is determined by the file name the read is coming from. In this example, in the merged.bam, reads from ga.bam will be attached RG:Z:ga, while reads from 454.bam will be attached RG:Z:454.

Protocols using this tool

PROcap preprocessing (with two replicates)
File formats this tool works with
BAM

Share your experience or ask a question