Category

Sequence Analysis


Usage

faSplit how input.fa count outRoot


Manual

This tool is part of UCSC Genome Browser's utilities.

Required arguments

  • how: This should be one of the following:
    • about: input.fa will be split to files with count bytes each by record
    • byname: input.fa will be broken up by sequence names (e.g. chr1, chr2, etc.).
    • baseinput.fa will be broken at any base.
    • gapinput.fa will be split into files of at most count bases each, at gap boundaries if possible.
    • sequenceinput.fa will be split by sequence will be broken at the nearest fa record boundary
    • sizeinput.fa will be broken every count bases.
  • input.fa: The input fasta file to be split
  • count
  • outRoot: Prefix for output files

Options

  • -verbose=2: Write names of each file created (=3 more details)
  • -maxN=N: Suppress pieces with more than maxN n's. Only used with size. default is size-1 (only suppresses pieces that are all N).
  • -oneFile: Put output in one file. Only used with size.
  • -extra=N: Add N extra bytes at the end to form overlapping pieces. Only used with size.
  • -out=outFile: Get masking from outfile. Only used with size.
  • -lift=file.lft: Put info on how to reconstruct sequence from pieces in file.lft. Only used with size and gap.
  • -minGapSize=X: Consider a block of Ns to be a gap if block size $\ge X$. Default value 1000. Only used with gap.
  • -noGapDrops: include all N's when splitting by gap.
  • -outDirDepth=N: Create N levels of output directory under current dir. This helps prevent NFS problems with a large number of file in a directory. Using -outDirDepth=3 would produce ./1/2/3/outRoot123.fa.
  • -prefixLength=N: used with byname option. create a separate output file for each group of sequences names with same prefix of length N.

Examples

The sequence mode
$ faSplit sequence dm6.fa 100 dm6_seq

This will break up dm6.fa into 100 (defined by count) files (numbered dm6_seq000.fa, dm6_seq001.fa, ..., dm6_seq099.fa). Files will only be broken at fasta record boundaries. For example, chromosomes in reference genomes are stored in separate fasta records (>chr2L, >chr2R, etc), the sequence mode makes sure sequence for one chromosome (record) will not be split into multiple files.

$ ls dm6_seq*
dm6_seq000.fa  dm6_seq015.fa  dm6_seq030.fa  dm6_seq045.fa  dm6_seq060.fa  dm6_seq075.fa  dm6_seq090.fa
dm6_seq001.fa  dm6_seq016.fa  dm6_seq031.fa  dm6_seq046.fa  dm6_seq061.fa  dm6_seq076.fa  dm6_seq091.fa
dm6_seq002.fa  dm6_seq017.fa  dm6_seq032.fa  dm6_seq047.fa  dm6_seq062.fa  dm6_seq077.fa  dm6_seq092.fa
dm6_seq003.fa  dm6_seq018.fa  dm6_seq033.fa  dm6_seq048.fa  dm6_seq063.fa  dm6_seq078.fa  dm6_seq093.fa
dm6_seq004.fa  dm6_seq019.fa  dm6_seq034.fa  dm6_seq049.fa  dm6_seq064.fa  dm6_seq079.fa  dm6_seq094.fa
dm6_seq005.fa  dm6_seq020.fa  dm6_seq035.fa  dm6_seq050.fa  dm6_seq065.fa  dm6_seq080.fa  dm6_seq095.fa
dm6_seq006.fa  dm6_seq021.fa  dm6_seq036.fa  dm6_seq051.fa  dm6_seq066.fa  dm6_seq081.fa  dm6_seq096.fa
dm6_seq007.fa  dm6_seq022.fa  dm6_seq037.fa  dm6_seq052.fa  dm6_seq067.fa  dm6_seq082.fa  dm6_seq097.fa
dm6_seq008.fa  dm6_seq023.fa  dm6_seq038.fa  dm6_seq053.fa  dm6_seq068.fa  dm6_seq083.fa  dm6_seq098.fa
dm6_seq009.fa  dm6_seq024.fa  dm6_seq039.fa  dm6_seq054.fa  dm6_seq069.fa  dm6_seq084.fa  dm6_seq099.fa
dm6_seq010.fa  dm6_seq025.fa  dm6_seq040.fa  dm6_seq055.fa  dm6_seq070.fa  dm6_seq085.fa
dm6_seq011.fa  dm6_seq026.fa  dm6_seq041.fa  dm6_seq056.fa  dm6_seq071.fa  dm6_seq086.fa
dm6_seq012.fa  dm6_seq027.fa  dm6_seq042.fa  dm6_seq057.fa  dm6_seq072.fa  dm6_seq087.fa
dm6_seq013.fa  dm6_seq028.fa  dm6_seq043.fa  dm6_seq058.fa  dm6_seq073.fa  dm6_seq088.fa
dm6_seq014.fa  dm6_seq029.fa  dm6_seq044.fa  dm6_seq059.fa  dm6_seq074.fa  dm6_seq089.fa
The base mode
$ faSplit base dm6_2L.fa 3 dm6_2L_base

The base mode works when there is only one record in the fasta file, and in the above example, faSplit breaks up the input file into 3 files:

$ ls dm6_2L_base*
dm6_2L_base0.fa  dm6_2L_base1.fa  dm6_2L_base2.fa
The size mode
$ faSplit size dm6.fa 200000 dm6_200k

This breaks up dm6.fa into 200,000 base chunks. Sequences will be renamed as dm6_200kNNN

$ head dm6_200k* | head
==> dm6_200k000.fa <==
>dm6_200k000
Cgacaatgcacgacagaggaagcagaacagatatttagattgcctctcat
tttctctcccatattatagggagaaatatgatcgcgtatgcgagagtagt
gccaacatattgtgctctttgattttttggcaacccaaaatggtggcgga
tgaaCGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA
TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG
CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
CAAACCCAAAAAgacaatacacgacagagagagagagcagcggagatatt
tagattgcctattaaatatgatcgcgtatgcgagagtagtgccaacatat
The about mode
$ faSplit about dm6.fa 200000 dm6_200k_about

This will break up dm6.fa into files of about 200,000 bytes each by record (i.e. sequence from the same chromosome/contig will not be split into multiple files).

$ ls -s dm6_200k_about* | head
23424 dm6_200k_about00.fa
25192 dm6_200k_about01.fa
28004 dm6_200k_about02.fa
31956 dm6_200k_about03.fa
 1344 dm6_200k_about04.fa
  204 dm6_200k_about05.fa
  204 dm6_200k_about06.fa
  204 dm6_200k_about07.fa
  208 dm6_200k_about08.fa
  204 dm6_200k_about09.fa
The byname mode (get fasta files for each chromosome/contig)
faSplit byname dm6.fa dm6_chrs/

This breaks up dm6.fa using sequence names as file names. Notes:

  • The output folder must exists; otherwise, the program raises errors like: mustOpen: Can't open dm6_chrs/chr2L.fa to write: No such file or directory
  • The output folder must have the terminating `/`
$ tree dm6_chrs | head
dm6_chrs
├── chr2L.fa
├── chr2R.fa
├── chr3L.fa
├── chr3R.fa
├── chr4.fa
├── chrM.fa
├── chrUn_CP007071v1.fa
├── chrUn_CP007072v1.fa
├── chrUn_CP007073v1.fa
The gap mode
$ faSplit gap chr2L.fa 2000000 chr2_gap

This breaks up chr2L.fa into files of at most 2,000,000 bases each, at gap boundaries if possible. If the sequence ends in N's, the last piece, if larger than 2,000,000, will be all one piece.

$ ls -s chr2_gap*
1996 chr2_gap00.fa  1996 chr2_gap02.fa  1996 chr2_gap04.fa  1996 chr2_gap06.fa  1996 chr2_gap08.fa  1996 chr2_gap10.fa
1996 chr2_gap01.fa  1996 chr2_gap03.fa  1996 chr2_gap05.fa  1996 chr2_gap07.fa  1996 chr2_gap09.fa  1508 chr2_gap11.fa


Share your experience or ask a question