Category

Multiple sequence alignments / Conservation


Usage

mafSpeciesSubset in.maf species.lst out.maf


Manual

This tool is part of UCSC Genome Browser's utilities.

Required arguments

  • in.maf: a file where the sequence source are either simple species names, or species.something. Usually actually it's a genome database name rather than a species before the dot to tell the truth. Set this value as stdin if you want to use this tool to read input from stdin.
  • species.lst: a file with a list of species to keep
  • out.maf: the output. It will have columns that are all - or . in the reduced species set removed, as well as the lines representing species not in species.lst removed. Set the value as stdout if you want use pipes.

Options

  • -keepFirst: If set, keep the first 'a' line in a maf no matter what. Useful for mafFrag results where we use this for the gene name

Example

In the following example, we extract the MAF of chromosome 22 for human(hg38) and Chimpanzee (panTro5) (defined in a file called species.lst) from the multiz30way multiple alignment file (stored in the file maf/chr22.maf.gz). The extracted multiple alignments are written to a file called results.maf.

$ cat species.lst
hg38
panTro5

$ mafSpeciesSubset maf/chr22.maf.gz species.lst results.maf

$ head results.maf
##maf version=1 scoring=roast.v3.3
a score=59305.000000
s hg38.chr22    10510000 7 + 50818468 GAATTCT
s panTro5.chr20 39659357 7 - 66533130 GAATTCT
i panTro5.chr20 N 0 C 0

a score=1614586.000000
s hg38.chr22    10510007 206 + 50818468 TGTGTTTATATAATAAGATGTCCTATAATTTCTGTTTGGAATATAAAATCAGCAACTAATATGTATTTTCAAAGCATTATAAATACAGAGTGCTAAGTTACTTCACTGTGAAATGTAGTCATATAAAGAACATAATAATTATACTGGATTATTTTTAAATGGGCTGTCTAACATTATATTAAAAGGTTTCATCAGTAATTCATTAT
s panTro5.chr20 39659364 206 - 66533130 TGTGTTTATATAATATGGTGTCCTATAATTTCTGTTTGGAATATAAAATCAGCAACTAGTGTGTATTTTCAAAGCATTATAAATACAGAGTGCTAAGTGACCTCACTGTGAAATGTAGTCATATAAAGAATATAATAATTATACTGGATTCTTTTTAAATGGGCTGTCTAACATTATATTAAAAGGTATCTCCAGTAATTCATTAT
i panTro5.chr20 C 0 C 0

Extracting the MAF for an entire chromosome can take some time, if you are only interested in a set of genomic ranges (say you define them in a bed file called range.bed), then you can extract MAF fragments with mafFrags and pipe the output to mafSpeciesSubset:

$ mafFrags hg38 multiz30way range.bed stdout | \
mafSpeciesSubset stdin species.lst results.maf


Share your experience or ask a question