Category

Reads Manipulation


Usage

fastp [options] -i <file> -o <file>


Manual

Brief introduction

fastp is a tool used in bioinformatics for the quality control and preprocessing of raw sequence data. It is designed to handle data from high-throughput sequencing platforms, such as Illumina. fastp provides several key functions:

  • It can filter out low-quality reads, which are sequences that have a high probability of containing errors. This is done based on quality scores that are assigned to each base in a read.
  • It can trim adapter sequences, which are artificial sequences added during the preparation of sequencing libraries and are not part of the actual sample's genome.
  • It can correct for errors in the sequencing process, such as mismatches or small insertions and deletions.
  • It provides comprehensive quality control reports, including information on sequence quality, GC content, sequence length distribution, and more.

fastp is known for its speed and efficiency, and it can process data in parallel, making it suitable for large datasets.

Required arguments

fastp supports both single-end (SE) and paired-end (PE) input/output.

  • -i, --in1 string: read1 input file name.
  • -o, --out1 string: read1 output file name.

For PE data, you should also specify read2 input and output by:

  • -I, --in2 string: read2 input file name.
  • -O, --out2 string: read2 output file name.

If you don't specify the output file names, no output files will be written, but the QC will still be done for both data before and after filtering. The output will be gzip-compressed if its file name ends with .gz

Options

  • -V, --verbose: output verbose log information (i.e. when every 1M reads are processed).
  • -?, --help: print help message
I/O options
  • --unpaired1 string: for PE input, if read1 passed QC but read2 not, it will be written to --unpaired1. Default is to discard it.
  • --unpaired2 string: for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --umpaired1 (default mode), both unpaired reads will be written to this same file.
  • --failed_out string: specify the file to store reads that cannot pass the filters.
  • -m, --merge: for paired-end input, merge each pair of reads into a single read if they are overlapped. The merged reads will be written to the file given by --merged_out, the unmerged reads will be written to the files specified by --out1 and --out2. The merging mode is disabled by default.
  • --merged_out string: in the merging mode, specify the file name to store merged output, or specify --stdout to stream the merged output
  • --include_unmerged: in the merging mode, write the unmerged or unpaired reads to the file specified by --merge. Disabled by default.
  • -6, --phred64: indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)
  • -z, --compression int: compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4.
  • --stdin: input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.
  • --stdout: stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default.
  • --interleaved_in: indicate that is an interleaved FASTQ which contains both read1 and read2. Disabled by default.
  • --reads_to_process int: If you don't want to process all the data, you can specify --reads_to_process to limit the reads to be processed. This is useful if you want to have a fast preview of the data quality, or you want to create a subset of the filtered data. Default 0 means process all reads.
  • --dont_overwrite: You can enable the option --dont_overwrite to protect the existing files not to be overwritten by fastp. In this case, fastp will report an error and quit if it finds any of the output files (read1, read2, json report, html report) already exists before. Overwritting is allowed by default.
  • --fix_mgi_id: the MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it. (New in version 0.20.1)
Adapter trimming options

fastp first trims the auto-detected adapter or the adapter sequences given by --adapter_sequence or --adapter_sequence_r2, then trims the adapters given by --adapter_fasta one by one. The sequence distribution of trimmed adapters can be found at the HTML/JSON reports.

  • -A, --disable_adapter_trimming: adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
  • -a, --adapter_sequence string: the adapter for read1. For SE data, if not specified, the adapters are evaluated by analyzing the tails of first ~1M reads (this evaluation may be inacurrate). For PE data, the adapters can be detected by per-read overlap analysis, which seeks for the overlap of each pair of reads. This method is robust and fast, so normally you don't have to input the adapter sequence even you know it. But you can still specify the adapter sequences for read1 by --adapter_sequence, and for read2 by --adapter_sequence_r2. If fastp fails to find an overlap (i.e. due to low quality bases), it will use these sequences to trim adapters for read1 and read2 respectively.
  • --adapter_sequence_r2 string: the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as sequenced defined in --adapter_sequence. The most widely used adapter is the Illumina TruSeq adapters. If your data is from the TruSeq library, you can add --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA and --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT to your command lines, or enable auto detection for PE data by specifing --detect_adapter_for_pe.
  • --adapter_fasta string: specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file. The adapter sequence in this file should be at least 6bp long, otherwise it will be skipped. And you can give whatever you want to trim, rather than regular sequencing adapters (i.e. polyA).
  • --detect_adapter_for_pe: by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data. For PE data, fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers. 
Global trimming options

fastp supports global trimming, which means trim all reads in the front or the tail. This function is useful since sometimes you want to drop some cycles of a sequencing run.

  • -f, --trim_front1 int: trimming how many bases in front for read1, default is 0.
  • -t, --trim_tail1 int: trimming how many bases in tail for read1, default is 0.
  • -b, --max_len1 int: if read1 is longer than --max_len1, then trim read1 at its tail to make it as long as --max_len1. Default 0 means no limitation.
  • -F, --trim_front2 int: trimming how many bases in front for read2. If it's not specified, it will follow read1's settings, default is 0.
  • -T, --trim_tail2 int: trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings, default is 0.
  • -B, --max_len2 int: if read2 is longer than --max_len2, then trim read2 at its tail to make it as long as --max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings.
Duplication evaluation and deduplication

For both SE and PE data, fastp supports evaluating its duplication rate and removing duplicated reads/pairs. fastp considers one read as duplicated only if its all base pairs are identical as another one. This meas if there is a sequencing error or an N base, the read will not be treated as duplicated.

  • -D, --dedup: enable deduplication to drop the duplicated reads/pairs. 
  • --dup_calc_accuracy int: accuracy level to calculate duplication (1~6), higher level uses more memory (1G, 2G, 4G, 8G, 16G, 24G). Default 1 for no-dedup mode, and 3 for dedup mode. (int [=0])
  • --dont_eval_duplication: don't evaluate duplication rate to save time and use less memory. By default, fastp evaluates duplication rate, and this module may use 1G memory and take 10% ~ 20% more running time. If you don't need the duplication rate information, you can set --dont_eval_duplication to disable the duplication evaluation. But please be noted that, if deduplication (--dedup) option is enabled, then --dont_eval_duplication option is ignored.
polyG tail trimming

For Illumina NextSeq/NovaSeq data, polyG can happen in read tails since G means no signal in the Illumina two-color systems. fastp can detect the polyG in read tails and trim them.

  • -g, --trim_poly_g: force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data. NextSeq/NovaSeq data is detected by the machine ID in the FASTQ records.
  • --poly_g_min_len int: the minimum length to detect polyG in the read tail. 10 by default.
  • -G, --disable_trim_poly_g: disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
polyX tail trimming
  • -x, --trim_poly_x: enable polyX trimming in 3' ends. This setting is useful for trimming the tails having polyX (i.e. polyA) before polyG. polyG is usually caused by sequencing artifacts, while polyA can be commonly found from the tails of mRNA-Seq reads.
  • --poly_x_min_len int: the minimum length to detect polyX in the read tail. 10 by default.
Per read cutting by quality options

fastp supports per read sliding window cutting by evaluating the mean quality scores in the sliding window. Be aware that these operations may interfere deduplication for SE data, and --cut_front or --cut_right may also interfere deduplication for PE data. The deduplication algorithms rely on the exact matchment of coordination regions of the grouped reads/pairs. If you don't set window size and mean quality threshold for these function respectively, fastp will use the values from -W and -M.

  • -5, --cut_front: move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. Please be noted that --cut_front will interfere deduplication for both PE/SE data.
  • -3, --cut_tail: move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise. --cut_tail will interfere deduplication for SE data, since the deduplication algorithms rely on the exact matchment of coordination regions of the grouped reads/pairs.
  • -r, --cut_right: move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop. If --cut_right is enabled, then there is no need to enable --cut_tail, since the former is more aggressive. If --cut_right is enabled together with --cut_front--cut_front will be performed first before --cut_right to avoid dropping whole reads due to the low quality starting bases.
  • -W, --cut_window_size int: the window size option shared by --cut_front, --cut_tail or --cut_sliding. Range: 1~1000, default: 4.
  • -M, --cut_mean_quality int: the mean quality requirement option shared by --cut_front, --cut_tail or --cut_sliding. Range: 1~36, default: 20 (Q20).
  • --cut_front_window_size int: the window size option of --cut_front, default to --cut_window_size if not specified, default: 4.
  • --cut_front_mean_quality int: the mean quality requirement option for --cut_front, default to --cut_mean_quality if not specified, default: 20.
  • --cut_tail_window_size int: the window size option of --cut_tail, default to --cut_window_size if not specified, default: 4.
  • --cut_tail_mean_quality int: the mean quality requirement option for --cut_tail, default to --cut_mean_quality if not specified, default: 20.
  • --cut_right_window_size int: the window size option of --cut_right, default to --cut_window_size if not specified, default: 4.
  • --cut_right_mean_quality int: the mean quality requirement option for --cut_right, default to --cut_mean_quality if not specified, default: 20.
Quality filtering options
  • -Q, --disable_quality_filtering: quality filtering is enabled by default. If this option is specified, quality filtering is disabled
  • -q, --qualified_quality_phred int: the quality value that a base is qualified. Default 15 means phred quality $\ge Q15$ is qualified, default: 15.
  • -u, --unqualified_percent_limit int: how many percents of bases are allowed to be unqualified (0~100). Default 40 means $40%$.
  • -n, --n_base_limit int: if one read's number of N base is $>n_base_limit$, then this read/pair is discarded. Default is 5.
  • -e, --average_qual int: if one read's average quality score $<avg_qual$, then this read/pair is discarded. Default 0 means no requirement.
Length filtering options
  • -L, --disable_length_filtering: length filtering is enabled by default. If this option is specified, length filtering is disabled
  • -l, --length_required int: reads shorter than length_required will be discarded, default is 15.
  • --length_limit int: reads longer than length_limit will be discarded, default 0 means no limitation.
Low complexity filtering

The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]). For example:

# a 51-bp sequence, with 3 bases that is different from its next base
seq = 'AAAATTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGCCCC'
complexity = 3/(51-1) = 6%
  • -y, --low_complexity_filter: enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
  • -Y, --complexity_threshold int: the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required.
Filter reads with unwanted indexes
  • --filter_by_index1 string: specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line.
  • --filter_by_index2 string: specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line.
  • --filter_by_index_threshold int: the allowed difference of index barcode for index filtering, default 0 means completely identical.
Base correction by overlap analysis options
  • -c, --correction: enable base correction in overlapped regions (only for PE data), default is disabled. fastp performs overlap analysis for PE data, which try to find an overlap of each pair of reads. When this option is enabled, and if an proper overlap is found, it can correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality. If a base is corrected, the quality of its paired base will be assigned to it so that they will share the same quality.
  • --overlap_len_require int: the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default.
  • --overlap_diff_limit int: the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default.
  • --overlap_diff_percent_limit int: the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%.
UMI processing

fastp can extract the Unique molecular identifiers (UMIs), and append them to the first part of read names, so the UMIs will also be presented in SAM/BAM records. If the UMI is in the reads, then it will be shifted from read so that the read will become shorter. If the UMI is in the index, it will be kept.

  • -U, --umi: enable UMI preprocessing
  • --umi_loc string: specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none.
    • index1: the first index is used as UMI. If the data is PE, this UMI will be used for both read1/read2.
    • index2: the second index is used as UMI. PE data only, this UMI will be used for both read1/read2.
    • read1: the head of read1 is used as UMI. If the data is PE, this UMI will be used for both read1/read2.
    • read2: the head of read2 is used as UMI. PE data only, this UMI will be used for both read1/read2.
    • per_indexindex1_index2 is used as UMI for both read1/read2.
    • per_read: define umi1 as the head of read1, and umi2 as the head of read2. umi1_umi2 is used as UMI for both read1/read2.
  • --umi_len int: if the UMI is in read1/read2, its length should be provided, default: 0.
  • --umi_prefix string: If a prefix is specified, an underline will be used to connect it and UMI. For example, if the UMI is AATTCCGG, and you specified --umi_prefix=UMI, then the final string presented in the name will be UMI_AATTCCGG. No prefix by default.
  • --umi_skip int: if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0.
Overrepresented sequence analysis
  • -p, --overrepresentation_analysis: enable overrepresented sequence analysis. For consideration of speed and memory, fastp only counts sequences with length of 10bp, 20bp, 40bp, 100bp or (cycles - 2 ).
  • -P, --overrepresentation_sampling int: one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. When overrepresented sequence analysis is enabled, fastp uses $\frac{1}{20}$ reads for sequence counting, and you can change this settings by specifying --overrepresentation_sampling option. For example, if you set -P 100, only $\frac{1}{100}$ reads will be used for counting, and if you set -P 1, all reads will be used but it will be extremely slow. The default value 20 is a balance of speed and accuracy.
Reporting options
  • -j, --json string: the json format report file name, default is fastp.json.
  • -h, --html string: the html format report file name, default is fastp.html.
  • -R, --report_title string: should be quoted with ' or ", default is "fastp report".
Threading options
  • -w, --thread int: worker thread number, default is 2.
Output splitting options

For parallel processing of FASTQ files (i.e. alignment in parallel), fastp supports splitting the output into multiple files. The splitting can work with two different modes: by limiting file number or by limiting lines of each file. These two modes cannot be enabled together.  

  • -s, --split int: split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default.
  • -S, --split_by_lines int: split output by limiting lines of each file with this option($\ge1000$), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default.
  • -d, --split_prefix_digits int: the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding. The file names of these split files will have a sequential number prefix, adding to the original file name specified by --out1 or --out2, and the width of the prefix is controlled by the -d option. For example, --split_prefix_digits=4--out1=out.fq--split=3, then the output files will be 0001.out.fq,0002.out.fq,0003.out.fq.
  • --cut_by_quality5: DEPRECATED, use --cut_front instead.
  • --cut_by_quality3: DEPRECATED, use --cut_tail instead.
  • --cut_by_quality_aggressive: DEPRECATED, use --cut_right instead.
  • --discard_unmerged: DEPRECATED, no effect now, see the introduction for merging.

Examples

Streaming input/output

fastp supports streaming the passing-filter reads to STDOUT (by specifying --stdout), so that it can be passed to other compressors like bzip2, or be passed to aligners like bwa and bowtie2. For PE data, the output will be interleaved FASTQ, which means the output will contain records like record1-R1 -> record1-R2 -> record2-R1 -> record2-R2 -> record3-R1 -> record3-R2 ...

fastp -i input_R1.fastq -I input_R2.fastq --stdout | bowtie2 -x reference_genome -1 - -2 - --interleaved -S alignment.sam

If you want to read the STDIN for processing, specify --stdin. if the STDIN is an interleaved paired-end stream, specify --interleaved_in to indicate that.

Unique molecular identifier (UMI) processing

UMI is useful for duplication elimination and error correction based on generating consensus of reads originated from a same DNA fragment. It's usually used in deep sequencing applications like ctDNA sequencing.

For the following example, UMIs locate at the begining of the reads (first 8 bps):

@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

After it's processed with command we move the UMIs to the read IDs:

fastp -i R1.fq -o out.R1.fq -U --umi_loc=read1 --umi_len=8
@NS500713:64:HFKJJBGXY:1:11101:1675:1101:AAAAAAAA 1:N:0:TATAGCCT+GACCCCCA
GCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
EEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
Merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode:

  • --merged_out should be given to specify the file to store merged reads, otherwise you should enable --stdout to stream the merged reads to STDOUT. The merged reads are also filtered.
  • --out1 and --out2 will be the reads that cannot be merged successfully, but both pass all the filters.
  • --unpaired1 will be the reads that cannot be merged, read1 passes filters but read2 doesn't.
  • --unpaired2 will be the reads that cannot be merged, read2 passes filters but read1 doesn't.
  • --include_unmerged can be enabled to make reads of --out1--out2--unpaired1 and --unpaired2 redirected to --merged_out. So you will get a single output file. This option is disabled by default.

--failed_out can still be given to store the reads (either merged or unmerged) failed to passing filters.

In the output file, a tag like merged_xxx_yyy will be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15 means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

Notes

fastp's processing steps
  1. UMI preprocessing (--umi)
  2. global trimming at front (--trim_front)
  3. global trimming at tail (--trim_tail)
  4. quality pruning at 5' (--cut_front)
  5. quality pruning by sliding window (--cut_right)
  6. quality pruning at 3' (--cut_tail)
  7. trim polyG (--trim_poly_g, enabled by default for NovaSeq/NextSeq data)
  8. trim adapter by overlap analysis (enabled by default for PE data)
  9. trim adapter by adapter sequence (--adapter_sequence, --adapter_sequence_r2. For PE data, this step is skipped if last step succeeded)
  10. trim polyX (--trim_poly_x)
  11. trim to max length (--max_len)
More on deduplication

fastp uses a hash algorithm to find the identical sequences. Due to the possible hash collision, about 0.01% of the total reads may be wrongly recognized as deduplicated reads. Normally this may not impact the downstream analysis. The accuracy of calculating duplication can be improved by increasing the hash buffer number or enlarge the buffer size. The option --dup_calc_accuracy can be used to specify the level. The higher level means more memory usage and more running time. Please refer to following table:

dup_calc_accuracy level hash buffer number buffer size memory usage speed  
1 1 1G 1G ultra-fast default for no-dedup mode
2 1 2G 2G fast  
3 2 2G 4G fast default for dedup
4 2 4G 8G fast  
5 2 8G 12G fast  
6 3 8G 24G moderate  

 

Protocols using this tool

PROcap preprocessing (with two replicates)
File formats this tool works with
FASTQ

Share your experience or ask a question