Category

Sequence Analysis


Usage

samtools faidx [options] <ref.fasta> [region1 [...]]


Manual

This tool is part of the samtools suite.

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format (similar to bedtools getfasta).

The input file can be compressed in the BGZF format.

The sequences in the input file should all have different names. If they do not, indexing will emit a warning about duplicate sequences and retrieval will only produce subsequences from the first sequence with the duplicated name.

FASTQ files can be read and indexed by this command. Without using --fastq any extracted subsequence will be in FASTA format.

Required arguments

  • ref.fasta: Path to the input fasta file (for indexing or querying)

Options

  • [region 1, region 2, ...]: Regions of interest, in the format of chr[:from-to] (:from-to is optional, when not provided, faidx extracts the sequence of the entire chromosome). Multiple regions are supported. Coordinates are 1-based. See also the -r option.
  • -o, --output FILE: Write FASTA to file.
  • -n, --length INT: Length of FASTA sequence line. [60]
  • -c, --continue: Continue after trying to retrieve missing region.
  • -r, --region-file FILE: File of regions. Format is chr:from-to. One per line. Coordinates are 1-based.
  • -i, --reverse-complement: Reverse complement sequences.
  • --mark-strand TYPE: Add strand indicator to sequence name. Allowed values for TYPE:
    • rc: Append '/rc' when writing the reverse complement. (default)
    • no: Do not append anything
    • sign: Append '(+)' for forward strand or '(-)' for reverse complement. This matches the output of bedtools getfasta -s.
    • custom,<pos>,<neg>: Append string <pos> to names when writing the forward strand and <neg> when writing the reverse strand. Spaces are preserved, so it is possible to move the indicator into the comment part of the description line by including a leading space in the strings <pos> and <neg>.
  • --fai-idx FILE: name of the index file (default file.fa.fai).
  • --gzi-idx FILE: name of compressed file index (default file.fa.gz.gzi).
  • -f, --fastq: File and index in FASTQ format.
  • -h, --help: This message.

Examples

Index a Fasta file

Some tools may require the sequences in a fasta file to have accompany indexes in the fai format. The faidx command can do this for you:

$ samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta

If you want to output the chromosome sizes in a simple, tab-delimited format, with two columns: one for the chromosome name (e.g., "chr1," "chr2") and the other for the chromosome size in base pairs. Since this format is easy to parse, it's widely used in downstream analysis. If you have the fasta file at hand, you can get this tab file by cutting out the first two columns from the index file generated by faidx:

$ cut -f 1,2 GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai

chr1    248956422
chr2    242193529
chr3    198295559
chr4    190214555
chr5    181538259
Extract the sequence for genomic regions

Extract sequences for chr1:777836-777950 and chr2:1234567-1234789 (coordinates are 1-based):

$ samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta chr1:777836-777950 chr2:1234567-1234789
>chr1:777836-777950
AGTGTTGGGATTACAGGTGTCAGCCACTGAGCCTGGCGGAGCACTTTATGTTATTAAGTA
GCCTAACCCAGGTGGGTCGCTGTCCCTCACGCCTGTAATCCCGACAACTCTGATG
>chr2:1234567-1234789
GAGGTGAGATGTCCAGCCTGCCTCATGAAGCTATGGCATAAACGTGCCTGGACTGCAGAC
GCCTTCCTTTTTATTGCAGGACACAGCCGTCTGCCCCTCGTGCGAGTCCGTGAGCCTCTG
GGGCTCCACGTGCATTCACTGCCTCAGGGGGCAAAGCTGATGATCTTTCTCAAGACCACA
GCATCGATAAAGGGTCCTTCATGGAGCCTGGGTCCACTGTCTC

If you have a list of regions, you can save these regions into a text file, then use the -r option to get the sequences for these regions:

$ samtools faidx -r test.txt GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
>chr1:5842089-5842189
TGAGGGTCCATCTACCCCAGCAACTTCCGGCTTGGCCCGAGGAGTGTGAGTGGGTGTGGT
GTGAGCTGAAGGGTTCCTGTGCCTGGCAGTTTGACTTGCCT
>chr2:666777-666888
AGCAAGAGGGGCTTCAGTAAAGAGGGGAAGGCTGCACTGAAAACATGTTCTGGTTAACTT
GGTGAAAGAAGAAGGGTGCTCCACTGAAATCCAGACAAGGAAGCCTGAAGTT

Note: Remember to remove empty lines in the query region file (including the final empty line); otherwise, you may see error messages like:

[W::fai_get_val] Reference  not found in FASTA file, returning empty sequence
>
[faidx] Failed to fetch sequence in

File formats this tool works with
FASTA

Share your experience or ask a question