Category

Sequence Analysis


Usage

faSize file(s).fa


Manual

This tool is part of UCSC Genome Browser's utilities.

Required arguments

  • file(s).fa: Input FASTA file(s) to analyze. Multiple FASTA files can be provided and they should be separated by spaces.

Options

  • -detailed: Outputs name and size of each record. Has the side effect of printing nothing else.
  • -tab: Output statistics in a tab-separated format.
  • -veryDetailed: Outputs the following values for each record / sequence:
    • name: Name of the sequence
    • size: Size of the sequence
    • Ns: Number of hard-masked bases
    • real: Number of non-hard-masked bases
    • upper: Number of bases in upper cases
    • lower: Number of bases in lower cases (soft-masked)

Examples

Get summary statistics about the sequences in a FASTA file

By default, faSize calculates the total bases (including number of hard-masked (Ns), soft-masked (sequences in lower cases), and normal bases (in upper cases)). It also prints the mean, standard deviation, minimum, maximum, and median of sequence sizes. In the following example, we show the summary statistics for the human reference genome (hg38):

$ faSize GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
3099922541 bases (165046090 N's 2934876451 real 2934876451 upper 0 lower) in 195 sequences in 1 files
Total size: mean 15897038.7 sd 46804464.6 min 970 (chrUn_KI270394v1) max 248956422 (chr1) median 32032
N count: mean 846390.2 sd 3850369.1
U count: mean 15050648.5 sd 45227268.4
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real
Print results in tab-separated format

With the -tab option, faSize prints the stats in a tab-separated format:

$ faSize -tab GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
baseCount	3099922541
nBaseCount	165046090
realBaseCount	2934876451
upperBaseCount	2934876451
lowerBaseCount	0
seqCount	195
fileCount	1
meanSize	15897038.7
SdSize	46804464.6
minSize	970
minSeqSize	chrUn_KI270394v1
maxSize	248956422
maxSeqSize	chr1
medianSize	32032
nCountMean	846390.2
nCountSd	3850369.1
upperCountMean	15050648.5
upperCountSd	45227268.4
lowerCountMean	0.0
lowerCountSd	0.0
fracMasked	0.00
fracRealMasked	0.00
Get per sequence size

With the -detailed option, you can get size information for each sequence in the FASTA file:

$ faSize -detailed GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta | head
chr1	248956422
chr2	242193529
chr3	198295559
chr4	190214555
chr5	181538259
Get detailed statistics for each chromosome

With the -veryDetailed option, you can get more information for each sequence in the FASTA file:

$ faSize -veryDetailed GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta | head
chr1	248956422	18475410	230481012	230481012	0
chr2	242193529	1645301	240548228	240548228	0
chr3	198295559	195424	198100135	198100135	0
chr4	190214555	461888	189752667	189752667	0
chr5	181538259	2555066	178983193	178983193	0

File formats this tool works with
FASTA

Share your experience or ask a question