Reference Code backup Executable files
Index or query the sequences of regions from a fasta file
samtools faidx [options] <ref.fasta> [region1 [...]]
This tool is part of the samtools
suite.
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx
will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format (similar to bedtools getfasta
).
The input file can be compressed in the BGZF format.
The sequences in the input file should all have different names. If they do not, indexing will emit a warning about duplicate sequences and retrieval will only produce subsequences from the first sequence with the duplicated name.
FASTQ files can be read and indexed by this command. Without using --fastq any extracted subsequence will be in FASTA format.
faidx
extracts the sequence of the entire chromosome). Multiple regions are supported. Coordinates are 1-based. See also the -r option.bedtools getfasta -s
.Some tools may require the sequences in a fasta file to have accompany indexes in the fai format. The faidx
command can do this for you:
$ samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
If you want to output the chromosome sizes in a simple, tab-delimited format, with two columns: one for the chromosome name (e.g., "chr1," "chr2") and the other for the chromosome size in base pairs. Since this format is easy to parse, it's widely used in downstream analysis. If you have the fasta file at hand, you can get this tab file by cutting out the first two columns from the index file generated by faidx
:
$ cut -f 1,2 GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai chr1 248956422 chr2 242193529 chr3 198295559 chr4 190214555 chr5 181538259
Extract sequences for chr1:777836-777950 and chr2:1234567-1234789 (coordinates are 1-based):
$ samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta chr1:777836-777950 chr2:1234567-1234789 >chr1:777836-777950 AGTGTTGGGATTACAGGTGTCAGCCACTGAGCCTGGCGGAGCACTTTATGTTATTAAGTA GCCTAACCCAGGTGGGTCGCTGTCCCTCACGCCTGTAATCCCGACAACTCTGATG >chr2:1234567-1234789 GAGGTGAGATGTCCAGCCTGCCTCATGAAGCTATGGCATAAACGTGCCTGGACTGCAGAC GCCTTCCTTTTTATTGCAGGACACAGCCGTCTGCCCCTCGTGCGAGTCCGTGAGCCTCTG GGGCTCCACGTGCATTCACTGCCTCAGGGGGCAAAGCTGATGATCTTTCTCAAGACCACA GCATCGATAAAGGGTCCTTCATGGAGCCTGGGTCCACTGTCTC
If you have a list of regions, you can save these regions into a text file, then use the -r option to get the sequences for these regions:
$ samtools faidx -r test.txt
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
>chr1:5842089-5842189
TGAGGGTCCATCTACCCCAGCAACTTCCGGCTTGGCCCGAGGAGTGTGAGTGGGTGTGGT
GTGAGCTGAAGGGTTCCTGTGCCTGGCAGTTTGACTTGCCT
>chr2:666777-666888
AGCAAGAGGGGCTTCAGTAAAGAGGGGAAGGCTGCACTGAAAACATGTTCTGGTTAACTT
GGTGAAAGAAGAAGGGTGCTCCACTGAAATCCAGACAAGGAAGCCTGAAGTT
Note: Remember to remove empty lines in the query region file (including the final empty line); otherwise, you may see error messages like:
[W::fai_get_val] Reference not found in FASTA file, returning empty sequence > [faidx] Failed to fetch sequence in