Category

Mapping


Usage

STAR --genomeDir /path/to/genomeDir --readFilesIn /path/to/read1 [/path/to/read2] --runThreadN NumberOfThreads --option1-name option1-value(s) ...


Manual

This document is generated with STAR 2.7.1a.

Spliced Transcripts Alignment to a Reference (STAR) is a popular tool for aligning RNA-seq reads to a reference genome. It’s known for its speed and accuracy.

Required arguments

  • --genomeDir string: path to the directory where genome files are stored (default: ./GenomeDir/)
  • --readFilesIn string Read1 Read2(s): paths to files that contain input read1 (and, if needed, read2)

Options

Run Parameters
  • --runThreadN int: number of threads to run STAR (default 1)
  • --runDirPerm string: permissions for the directories created at the run-time.
    • User_RWX: user-read/write/execute (default)
    • All_RWX: all-read/write/execute (same as chmod 777)
  • --runRNGseed int: random number generator seed. (default: 777)
Genome Parameters
  • --genomeLoad string: mode of shared memory usage for the genome files.
    • LoadAndKeep: load genome into shared and keep it in memory after run
    • LoadAndRemove: load genome into shared but remove it after run
    • LoadAndExit: load genome into shared memory and exit, keeping the genome in memory for future runs
    • Remove: do not map anything, just remove loaded genome from memory
    • NoSharedMemory: do not use shared memory, each job will have its own private copy of the genome (default)
  • --genomeFastaFiles string(s): path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they cannot be zipped. Use it if you want to add extra (new) sequences to the genome (e.g. spike-ins). (default: -)
  • --genomeFileSizes uint(s)>0: genome files exact sizes in bytes. Typically, this should not be defined by the user. (default: 0)
  • --genomeConsensusFile string: VCF file with consensus SNPs (i.e. alternative allele is the major (AF>0.5) allele) (default: -)
Splice Junctions Database
  • --sjdbFileChrStartEnd string(s): path to the files with genomic coordinates (chr start end strand) for the splice junction introns. Multiple files can be supplied wand will be concatenated. (default: -)
  • --sjdbGTFfile string: path to the GTF file with annotations. (default: -)
  • --sjdbGTFchrPrefix string: prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes). (default: -)
  • --sjdbGTFfeatureExon string: feature type in GTF file to be used as exons for building transcripts. (default: exon)
  • --sjdbGTFtagExonParentTranscript string: GTF attribute name for parent transcript ID (default "transcript_id" works for GTF files) (default: transcript_id)
  • --sjdbGTFtagExonParentGene string: GTF attribute name for parent gene ID (default "gene_id" works for GTF files) (default: gene_id)
  • --sjdbGTFtagExonParentGeneName string(s): GTF attrbute name for parent gene name. (default: gene_name)
  • --sjdbGTFtagExonParentGeneType string(s): GTF attrbute name for parent gene type. (default: gene_type gene_biotype)
  • --sjdbOverhang int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1). (default: 100)
  • --sjdbScore int: extra alignment score for alignmets that cross database junctions. (default: 2)
  • --sjdbInsertSave string: which files to save when sjdb junctions are inserted on the fly at the mapping step
    • Basic: only small junction / transcript files (default)
    • All: all files including big Genome, SA and SAindex - this will create a complete genome directory
2-pass Mapping

In the 2-pass mode, STAR performs the alignment in two steps:

  1. First pass: STAR aligns the RNA-seq reads to the genome and generates alignment files and splice junctions.
  2. Second pass: The genome indices are re-generated using the splice junctions obtained from the first pass. Then, STAR re-maps all the reads.

The 2-pass mode is highly recommended if your goal is to robustly and accurately identify novel splice junctions for differential splicing analysis and variant discovery.

  • --twopassMode string: 2-pass mapping mode.
    • None: 1-pass mapping (default)
    • Basic: basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly
  • --twopass1readsN int: number of reads to process for the 1st step. Use very large number (or default -1) to map all reads in the first step. (default: -1)
Variation parameters
  • --varVCFfile string: path to the VCF file that contains variation data. Note: The VCF file cannot be in a compressed format (like gzip). If you have variants in multiple files, e.g. one vcf for one chromosome, you can concatenate them together and pass them to STAR with command like --varVCFfile <(cat chr*.vcf)
  • --waspOutputMode string: WASP allele-specific output type. Requires --varVCFfile to be speficied. This is re-implemenation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad & Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015). WASP filtering is activated with --waspOutputMode SAMtag, which will add vW tag to the reads that overlap variants. Tag values:
    • vW:i:1 means alignment passed WASP filtering, and all other values mean it did not pass:
      • vW:i:2 - multi-mapping read
      • vW:i:3 - variant base in the read is N (non-ACGT)
      • vW:i:4 - remapped read did not map
      • vW:i:5 - remapped read multi-maps
      • vW:i:6 - remapped read maps to a different locus
      • vW:i:7 - read overlaps too many variants
Read Parameters
  • --readFilesType string: format of input read files
    • Fastx: FASTA or FASTQ (default)
    • SAM SE: SAM or BAM single-end reads; for BAM use --readFilesCommand samtools view
    • SAM PE: SAM or BAM paired-end reads; for BAM use --readFilesCommand samtools view
  • --readFilesPrefix string: preifx for the read files names, i.e. it will be added in front of the strings in --readFilesIn. (default -: no prefix)
  • --readFilesCommand string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout. For example: zcat to uncompress .gz files, bzcat to uncompress .bz2 files, etc. (default: -)
  • --readMapNumber int: number of reads to map from the beginning of the file (default: -1: map all reads)
  • --readMatesLengthsIn string: Equal/NotEqual - lengths of names,sequences,qualities for both mates are the same / not the same. NotEqual is safe in all situations. (default: NotEqual)
  • --readNameSeparator string(s): character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed) (default: /)
  • --clip3pNbases int(s): number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates. (default: 0)
  • --clip5pNbases int(s): number(s) of bases to clip from 5p of each mate. If one value is given, it will be assumed the same for both mates. (default: 0)
  • --clip3pAdapterSeq string(s): adapter sequences to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates. (default: -)
  • --clip3pAdapterMMp double(s): max proportion of mismatches for 3p adpater clipping for each mate. If one value is given, it will be assumed the same for both mates. (default: 0.1)
  • --clip3pAfterAdapterNbases int(s): number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates. (default: 0)
Limits
  • --limitGenomeGenerateRAM int>0: maximum available RAM (bytes) for genome generation (default: 31000000000)
  • --limitIObufferSize int>0: max available buffers size (bytes) for input/output, per thread (default: 150000000)
  • --limitOutSAMoneReadBytes int>0: max size of the SAM record (bytes) for one read. Recommended value: $>(2\times(\text{LengthMate1}+\text{LengthMate2}+100)\times\text{outFilterMultimapNmax}$ (default: 100000)
  • --limitOutSJoneRead int>0: max number of junctions for one read (including all multi-mappers) (default: 1000)
  • --limitOutSJcollapsed int>0: max number of collapsed junctions (default: 1000000)
  • --limitBAMsortRAM int>=0: maximum available RAM (bytes) for sorting BAM. If =0, it will be set to the genome index size. 0 value can only be used with --genomeLoad NoSharedMemory option. (default: 0)
  • --limitSjdbInsertNsj int>=0: maximum number of junction to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run. (default: 1000000)
  • --limitNreadsSoft int: soft limit on the number of reads. (default: -1)
Output
General options
  • --outFileNamePrefix string: output files name prefix (including full or relative path). Can only be defined on the command line. (default: ./)
  • --outTmpDir string: path to a directory that will be used as temporary by STAR. All contents of this directory will be removed! the temp directory will default to outFileNamePrefix_STARtmp. (default: -)
  • --outTmpKeep string: whether to keep the tempporary files after STAR runs is finished
    • None: remove all temporary files (default)
    • All: keep all files
  • --outStd string: which output will be directed to stdout (standard out)
    • Log: log messages (default)
    • SAM: alignments in SAM format (which normally are output to Aligned.out.sam file), normal standard output will go into Log.std.out
    • BAM_Unsorted: alignments in BAM format, unsorted. Requires --outSAMtype BAM Unsorted
    • BAM_SortedByCoordinate: alignments in BAM format, unsorted. Requires --outSAMtype BAM SortedByCoordinate
    • BAM_Quant: alignments to transcriptome in BAM format, unsorted. Requires --quantMode TranscriptomeSAM
  • --outReadsUnmapped string: output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s).
    • None: no output (default)
    • Fastx: output in separate fasta/fastq files, Unmapped.out.mate1/2
  • --outQSconversionAdd int: add this number to the quality score (e.g. to convert from Illumina to Sanger, use -31) (default: 0)
  • --outMultimapperOrder string: order of multimapping alignments in the output files
    • Old_2.4: quasi-random order used before 2.5.0 (default)
    • Random: random order of alignments for each multi-mapper. Read mates (pairs) are always adjacent, all alignment for each read stay together. This option will become default in the future releases.
SAM and BAM
  • --outSAMtype string(s), type of SAM/BAM output.
    • 1st word:
      • BAM: output BAM without sorting
      • SAM: output SAM without sorting (default)
      • None: no SAM/BAM output
    • 2nd, 3rd words:
      • Unsorted: standard unsorted
      • SortedByCoordinate: sorted by coordinate. This option will allocate extra memory for sorting which can be specified by --limitBAMsortRAM.
  • --outSAMmode string: mode of SAM output.
    • None: no SAM output
    • Full: full SAM output (default)
    • NoQS: full SAM but without quality scores
  • --outSAMstrandField string: Cufflinks-like strand field flag
    • None: not used (default)
    • intronMotif: strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.
  • --outSAMattributes string: a string of desired SAM attributes, in the order desired for the output SAM.
    • NH HI AS nM NM MD jM jI XS MC ch: any combination in any order
    • None: no attributes Standard: NH HI AS nM (default)
    • All: NH HI AS nM NM MD jM jI MC ch
    • vA: variant allele
    • vG: genomic coordiante of the variant overlapped by the read
    • vW: 0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag
    • CR CY UR UY: sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing
    • Unsupported/undocumented:
      • rB: alignment block read/genomic coordinates
      • vR: read coordinate of the variant
  • --outSAMattrIHstart int>=0: start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie. (default: 1)
  • --outSAMunmapped string(s): output of unmapped reads in the SAM format.
    • 1st word:
      • None: no output (default)
      • Within: output unmapped reads within the main SAM file (i.e. Aligned.out.sam)
    • 2nd word:
      • KeepPairs: record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate. Only affects multi-mapping reads.
  • --outSAMorder string: type of sorting for the SAM output.
    • Paired: one mate after the other for all paired alignments (default)
    • PairedKeepInputOrder: one mate after the other for all paired alignments, the order is kept the same as in the input FASTQ files
  • --outSAMprimaryFlag string: which alignments are considered primary - all others will be marked with 0x100 bit in the FLAG.
    • OneBestScore: only one alignment with the best score is primary (default)
    • AllBestScore: all alignments with the best score are primary
  • --outSAMreadID string: read ID record type
    • Standard: first word (until space) from the FASTx read ID line, removing /1,/2 from the end (default)
    • Number: read number (index) in the FASTx file
  • --outSAMmapqUnique int: 0 to 255: the MAPQ value for unique mappers. (default: 255)
  • --outSAMflagOR int: 0 to 65535: sam FLAG will be bitwise OR'd with this value, i.e. FLAG=FLAG | outSAMflagOR. This is applied after all flags have been set by STAR, and after --outSAMflagAND. Can be used to set specific bits that are not set otherwise. (default: 0)
  • --outSAMflagAND int: 0 to 65535: sam FLAG will be bitwise AND'd with this value, i.e. FLAG=FLAG & outSAMflagOR. This is applied after all flags have been set by STAR, but before --outSAMflagOR. Can be used to unset specific bits that are not set otherwise. (default: 65535)
  • --outSAMattrRGline string(s): SAM/BAM read group line. The first word contains the read group identifier and must start with "ID:", e.g. --outSAMattrRGline ID:xxx CN:yy "DS:z z z". xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted. Comma separated RG lines correspons to different (comma separated) input files in
  • --readFilesIn. Commas have to be surrounded by spaces, e.g. --outSAMattrRGline ID:xxx , ID:zzz "DS:z z" , ID:yyy DS:yyyy. (default: -)
  • --outSAMheaderHD strings: @HD (header) line of the SAM header. (default: -)
  • --outSAMheaderPG strings: extra @PG (software) line of the SAM header (in addition to STAR). (default: -)
  • --outSAMheaderCommentFile string: path to the file with @CO (comment) lines of the SAM header. (default: -)
  • --outSAMfilter string(s): filter the output into main SAM/BAM files. (default: None)
    • KeepOnlyAddedReferences: only keep the reads for which all alignments are to the extra reference sequences added with --genomeFastaFiles at the mapping stage.
    • KeepAllAddedReferences: keep all alignments to the extra reference sequences added with --genomeFastaFiles at the mapping stage.
  • --outSAMmultNmax int: max number of multiple alignments for a read that will be output to the SAM/BAM files. -1: all alignments (up to --outFilterMultimapNmax) will be output. (default: -1)
  • --outSAMtlen int: calculation method for the TLEN field in the SAM/BAM files 1: leftmost base of the (+)strand mate to rightmost base of the (-)mate. (+)sign for the (+)strand mate (default) 2: leftmost base of any mate to rightmost base of any mate. (+)sign for the mate with the leftmost base. This is different from 1 for overlapping mates with protruding ends
  • --outBAMcompression int: 1 to 10 BAM compression level, 1=default compression (6?), 0=no compression, 10=maximum compression. (default: 1)
  • --outBAMsortingThreadN int: >=0: number of threads for BAM sorting. 0 will default to min(6, --runThreadN). (default: 0)
  • --outBAMsortingBinsN int: >0: number of genome bins fo coordinate-sorting. (default: 50)
Wiggle
  • --outWigType string(s): type of signal output, e.g. "bedGraph" OR "bedGraph read1_5p". Requires sorted BAM: --outSAMtype BAM SortedByCoordinate.
    • 1st word:
      • None: no signal output (default)
      • bedGraph: bedGraph format
      • wiggle: wiggle format
    • 2nd word:
      • read1_5p: signal from only 5' of the 1st read, useful for CAGE/RAMPAGE/PROcap etc
      • read2: signal from only 2nd read
  • --outWigStrand string: strandedness of wiggle/bedGraph output.
    • Stranded: separate strands, str1 and str2 (default)
    • Unstranded: collapsed strands
  • --outWigReferencesPrefix string: prefix matching reference names to include in the output wiggle file, e.g. "chr", default "-" - include all references. (default: -)
  • --outWigNorm string: type of normalization for the signal
    • RPM: reads per million of mapped reads (default)
    • None: no normalization, "raw" counts
BAM processing
  • --bamRemoveDuplicatesType string: mark duplicates in the BAM file, for now only works with(i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only.
    • -: no duplicate removal/marking (default)
    • UniqueIdentical: mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical
    • UniqueIdenticalNotMulti: mark duplicate unique mappers but not multimappers.
  • --bamRemoveDuplicatesMate2basesN int>0: number of bases from the 5' of mate 2 to use in collapsing (e.g. for RAMPAGE) (default: 0)
Output Filtering
  • --outFilterType string: type of filtering
    • Normal: standard filtering using only current alignment (default)
    • BySJout: keep only those reads that contain junctions that passed filtering into SJ.out.tab
  • --outFilterMultimapScoreRange int: the score range below the maximum score for multimapping alignments. (default: 1)
  • --outFilterMultimapNmax int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value. Otherwise no alignments will be output, and the read will be counted as "mapped to too many loci" in the Log.final.out. (default: 10)
  • --outFilterMismatchNmax int: alignment will be output only if it has no more mismatches than this value. (default: 10)
  • --outFilterMismatchNoverLmax real: alignment will be output only if its ratio of mismatches to mapped length is less than or equal to this value. (default: 0.3)
  • --outFilterMismatchNoverReadLmax real: alignment will be output only if its ratio of mismatches to read length is less than or equal to this value. (default: 1.0)
  • --outFilterScoreMin int: alignment will be output only if its score is higher than or equal to this value. (default: 0)
  • --outFilterScoreMinOverLread real: same as outFilterScoreMin, but normalized to read length (sum of mates' lengths for paired-end reads) (default: 0.66)
  • --outFilterMatchNmin int: alignment will be output only if the number of matched bases is higher than or equal to this value. (default: 0)
  • --outFilterMatchNminOverLread (default: 0).66 real: sam as outFilterMatchNmin, but normalized to the read length (sum of mates' lengths for paired-end reads).
  • --outFilterIntronMotifs string: filter alignment using their motifs
    • None: no filtering (default)
    • RemoveNoncanonical: filter out alignments that contain non-canonical junctions
    • RemoveNoncanonicalUnannotated: filter out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept.
  • --outFilterIntronStrands string: filter alignments
    • RemoveInconsistentStrands: remove alignments that have junctions with inconsistent strands (default)
    • None: no filtering
Splice Junctions
  • --outSJfilterReads string: which reads to consider for collapsed splice junctions output
    • All: all reads, unique- and multi-mappers (default)
    • Unique: uniquely mapping reads only
  • --outSJfilterOverhangMin 4 integers: minimum overhang length for splice junctions on both sides for:
    1. non-canonical motifs,
    2. GT/AG and CT/AC motif,
    3. GC/AG and CT/GC motif,
    4. AT/AC and GT/AT motif.

    -1 means no output for that motif does not apply to annotated junctions. (default: 30 12 12 12)

  • --outSJfilterCountUniqueMin 4 integers: minimum uniquely mapping read count per junction for:
    1. non-canonical motifs
    2. GT/AG and CT/AC motif
    3. GC/AG and CT/GC motif
    4. AT/AC and GT/AT motif

    -1 means no output for that motif. Junctions are output if one of --outSJfilterCountUniqueMin OR --outSJfilterCountTotalMin conditions are satisfied does not apply to annotated junctions. (default: 3 1 1 1)

  • --outSJfilterCountTotalMin (default: 4) integers: minimum total (multi-mapping+unique) read count per junction for:
    1. non-canonical motifs
    2. GT/AG and CT/AC motif
    3. GC/AG and CT/GC motif
    4. AT/AC and GT/AT motif

    -1 means no output for that motif Junctions are output if one of --outSJfilterCountUniqueMin OR --outSJfilterCountTotalMin conditions are satisfied does not apply to annotated junctions. (default: 3 1 1 1)

  • --outSJfilterDistToOtherSJmin 4 integers>=0: minimum allowed distance to other junctions' donor/acceptor. Does not apply to annotated junctions. (default: 10 0 5 10)
  • --outSJfilterIntronMaxVsReadN N integers>=0: maximum gap allowed for junctions supported by 1,2,3,,,N reads i.e. by default junctions supported by 1 read can have gaps <=50000b, by 2 reads: <=100000b, by 3 reads: <=200000. by >=4 reads any gap <=alignIntronMax. Does not apply to annotated junctions. (default: 50000 100000 200000)
Scoring
  • --scoreGap int: splice junction penalty (independent on intron motif). (default: 0)
  • --scoreGapNoncan int: non-canonical junction penalty (in addition to scoreGap). (default: -8)
  • --scoreGapGCAG GC/AG and CT/GC junction penalty (in addition to scoreGap) (default: -4)
  • --scoreGapATAC AT/AC and GT/AT junction penalty (in addition to scoreGap) (default: -8)
  • --scoreGenomicLengthLog2scale extra score logarithmically scaled with genomic length of the alignment: $\text{scoreGenomicLengthLog2scale}\times\log_2(\text{genomicLength})$. (default: -0.25)
  • --scoreDelOpen deletion open penalty. (default: -2)
  • --scoreDelBase deletion extension penalty per base (in addition to scoreDelOpen). (default: -2)
  • --scoreInsOpen insertion open penalty. (default: -2)
  • --scoreInsBase insertion extension penalty per base (in addition to scoreInsOpen). (default: -2)
  • --scoreStitchSJshift maximum score reduction while searching for SJ boundaries inthe stitching step. (default: 1)
Alignments and Seeding
  • --seedSearchStartLmax int>0: defines the search start point through the read - the read is split into pieces no longer than this value. (default: 50)
  • --seedSearchStartLmaxOverLread real: seedSearchStartLmax normalized to read length (sum of mates' lengths for paired-end reads). (default: 1.0)
  • --seedSearchLmax int>=0: defines the maximum length of the seeds, if =0 max seed lengthis infinite. (default: 0)
  • --seedMultimapNmax int>0: only pieces that map fewer than this value are utilized in the stitching procedure. (default: 10000)
  • --seedPerReadNmax int>0: max number of seeds per read. (default: 1000)
  • --seedPerWindowNmax int>0: max number of seeds per window. (default: 50)
  • --seedNoneLociPerWindow int>0: max number of one seed loci per window. (default: 10)
  • --seedSplitMin int>0: min length of the seed sequences split by Ns or mate gap. (default: 12)
  • --alignIntronMin minimum intron size: genomic gap is considered intron if its length>=alignIntronMin, otherwise it is considered Deletion. (default: 21)
  • --alignIntronMax maximum intron size, if 0, max intron size will be determined by $(2^\text{winBinNbits})\times\text{winAnchorDistNbins}$. (default: 0)
  • --alignMatesGapMax maximum gap between two mates, if 0, max intron gap will be determined by $(2^\text{winBinNbits})\times\text{winAnchorDistNbins}$. (default: 0)
  • --alignSJoverhangMin int>0: minimum overhang (i.e. block size) for spliced alignments. (default: 5)
  • --alignSJstitchMismatchNmax 4*int>=0: maximum number of mismatches for stitching of the splice junctions (-1: no limit). (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. (default: 0 -1 0 0)
  • --alignSJDBoverhangMin int>0: minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments. (default: 3)
  • --alignSplicedMateMapLmin int>0: minimum mapped length for a read mate that is spliced. (default: 0)
  • --alignSplicedMateMapLminOverLmate real>0: alignSplicedMateMapLmin normalized to mate length. (default: 0.66)
  • --alignWindowsPerReadNmax int>0: max number of windows per read. (default: 10000)
  • --alignTranscriptsPerWindowNmax int>0: max number of transcripts per window. (default: 100)
  • --alignTranscriptsPerReadNmax int>0: max number of different alignments per read to consider. (default: 10000)
  • --alignEndsType string: type of read ends alignment
    • Local: standard local alignment with soft-clipping allowed (default)
    • EndToEnd: force end-to-end read alignment, do not soft-clip
    • Extend5pOfRead1: fully extend only the 5p of the read1, all other ends: local alignment
    • Extend5pOfReads12: fully extend only the 5p of the both read1 and read2, all other ends: local alignment
  • --alignEndsProtrude int, string: allow protrusion of alignment ends, i.e. start (end) of the +strand mate downstream of the start (end) of the -strand mate (default: 0 ConcordantPair)
    • 1st word: int: maximum number of protrusion bases allowed
    • 2nd word: string: ConcordantPair: report alignments with non-zero protrusion as concordant pairs DiscordantPair: report alignments with non-zero protrusion as discordant pairs
  • --alignSoftClipAtReferenceEndsstring: allow the soft-clipping of the alignments past the end of the chromosomes
    • Yes: allow (default)
    • No: prohibit, useful for compatibility with Cufflinks
  • --alignInsertionFlush string: how to flush ambiguous insertion positions None: insertions are not flushed (default) Right: insertions are flushed to the right
Paired-End reads
  • --peOverlapNbasesMin int>=0: minimum number of overlap bases to trigger mates merging and realignment. (default: 0)
  • --peOverlapMMp real, >=0& <1: maximum proportion of mismatched bases in the overlap area. (default: 0.01)
Windows, Anchors, Binning
  • --winAnchorMultimapNmax int>0: max number of loci anchors are allowed to map to. (default: 50)
  • --winBinNbits int>0: $=\log_2(\text{winBin})$, where $\text{winBin}$ is the size of the bin for the windows/clustering, each window will occupy an integer number of bins. (default: 16)
  • --winAnchorDistNbins int>0: max number of bins between two anchors that allows aggregation of anchors into one window. (default: 9)
  • --winFlankNbins int>0: $\log_2(\text{winFlank})$, where $\text{winFlank}$ is the size of the left and right flanking regions for each window. (default: 4)
  • --winReadCoverageRelativeMin real>=0: minimum relative coverage of the read sequence by the seeds in a window, for STARlong algorithm only. (default: 0.5)
  • --winReadCoverageBasesMin int>0: minimum number of bases covered by the seeds in a window, for STARlong algorithm only. (default: 0)
Chimeric Alignments
  • --chimOutType string(s): type of chimeric output
    • Junctions: Chimeric.out.junction (default)
    • SeparateSAMold: output old SAM into separate Chimeric.out.sam file
    • WithinBAM: output into main aligned BAM files (Aligned.*.bam)
    • WithinBAM HardClip: (default) hard-clipping in the CIGAR for supplemental chimeric alignments (defaultif no 2nd word is present)
    • WithinBAM SoftClip: soft-clipping in the CIGAR for supplemental chimeric alignments
  • --chimSegmentMin (default: 0) int>=0: minimum length of chimeric segment length, if ==0, no chimeric output
  • --chimScoreMin (default: 0) int>=0: minimum total (summed) score of the chimeric segments
  • --chimScoreDropMax (default: 20) int>=0: max drop (difference) of chimeric score (the sum of scores of all chimeric segments) from the read length
  • --chimScoreSeparation (default: 10) int>=0: minimum difference (separation) between the best chimeric score and the next one
  • --chimScoreJunctionNonGTAG (default: -1) int: penalty for a non-GT/AG chimeric junction
  • --chimJunctionOverhangMin (default: 20) int>=0: minimum overhang for a chimeric junction
  • --chimSegmentReadGapMax (default: 0) int>=0: maximum gap in the read sequence between chimeric segments
  • --chimFilter banGenomicN string(s): different filters for chimeric alignments
    • None: no filtering
    • banGenomicN: Ns are not allowed in the genome sequence around the chimeric junction
  • --chimMainSegmentMultNmax (default: 10) int>=1: maximum number of multi-alignments for the main chimeric segment. =1 will prohibit multimapping main segments.
  • --chimMultimapNmax (default: 0) int>=0: maximum number of chimeric multi-alignments 0: use the old scheme for chimeric detection which only considered unique alignments
  • --chimMultimapScoreRange (default: 1) int>=0: the score range for multi-mapping chimeras below the best chimeric score. Only works with --chimMultimapNmax > 1
  • --chimNonchimScoreDropMin (default: 20) int>=0: to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be smaller than this value
  • --chimOutJunctionFormat (default: 0) int: formatting type for the Chimeric.out.junction file 0: no comment lines/headers 1: comment lines at the end of the file: command line and Nreads: total, unique, multi
Quantification of Annotations
  • --quantMode (default: -) string(s): types of quantification requested
    • -: none
    • TranscriptomeSAM: output SAM/BAM alignments to transcriptome into a separate file. Use this option if you want to quantify expression levels of different isoforms/transcripts in conjugation with tools like RSEM.
    • GeneCounts: count reads per gene. Use this option if you want a table of read counts mapped to each gene, the table will be saved to your output folder with file names like ReadsPerGene.out.tab.
  • --quantTranscriptomeBAMcompression int: -2 to 10 transcriptome BAM compression level
    • -2: no BAM output
    • -1: default compression (6?)
    • 0: no compression
    • 10: maximum compression
  • --quantTranscriptomeBan string: prohibit various alignment type
    • IndelSoftclipSingleend: prohibit indels, soft clipping and single-end alignments - compatible with RSEM (default)
    • Singleend: prohibit single-end alignments
Miscs
  • --versionGenome string: earliest genome index version compatible with this STAR release. Please do not change this value!
  • --parametersFiles string: name of a user-defined parameters file, "-": none. Can only be defined on the command line. (default: -)
  • --sysShell string: path to the shell binary, preferably bash, e.g. /bin/bash. the default shell is executed, typically /bin/sh. This was reported to fail on some Ubuntu systems - then you need to specify path to bash.

Examples

Before proceeding to the examples, please make sure you have a STAR index already. If you don't have it, refer to STAR genomeGenerate to generate an index on your machine, or visit here to download some precompiled indexes. In this section, we assume the index is stored in a folder called /data/star_index, remember to replace it with your own path.

ENCODE RNA-seq processing pipe-lines

The ENCODE RNA-seq processing pipelines align sequencing reads to both the genome and the transcriptome, so you will get two bam files: Aligned.sortedByCoord.out.bam (mapped to the genome, for general use) and Aligned.toTranscriptome.out.bam (mapped to the transcriptome, triggered by --quantMode TranscriptomeSAM). You can feed Aligned.toTranscriptome.out.bam to transcript quantification tools like RSEM to quantify the expression level of RNA transcripts/isoforms.

For paired-end libraries: assume you have gzip compressed read files read1.fq.gz and read2.fq.gz, you can use command like the following to align them:

# Source: https://github.com/ENCODE-DCC/rna-seq-pipeline/blob/53d8c96a112bfa1079f21e680a7bfc3df3a6f031/src/align.py
$ STAR --genomeDir /data/star_index --readFilesIn read1.fq.gz read2.fq.gz \
    --readFilesCommand zcat --runThreadN 4 \
    --genomeLoad NoSharedMemory --outFilterMultimapNmax 20 \
    --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 \
    --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 \
    --alignIntronMin 20 --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000 --outSAMheaderHD @HD VN:1.4 SO:coordinate \
    --outSAMunmapped Within --outFilterType BySJout \
    --outSAMattributes NH HI AS NM MD --outSAMtype BAM SortedByCoordinate \
    --quantMode TranscriptomeSAM --sjdbScore 1 \
    --limitBAMsortRAM 8000000000

For single-end libraries: assume you have a gzip compressed read file reads.fq.gz, you can use command like the following to align them:

STAR --genomeDir /data/star_index --readFilesIn reads.fq.gz \
    --readFilesCommand zcat --runThreadN 4 \
    --genomeLoad NoSharedMemory --outFilterMultimapNmax 20 \
    --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 \
    --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 \
    --alignIntronMin 20 --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000 --outSAMheaderHD @HD VN:1.4 SO:coordinate \
    --outSAMunmapped Within --outFilterType BySJout \
    --outSAMattributes NH HI AS NM MD --outSAMstrandField intronMotif \
    --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM \
    --sjdbScore 1 --limitBAMsortRAM 8000000000
ENCODE RAMPAGE processing pipe-lines

The RAMPAGE libraries from the ENCODE project have 6 nt pool barcode in the end of the RNAs and are generated with random primers (N=15), so when aligning the reads, they use --clip5pNbases 6 15 to remove these two types of artifacts.

# Source: https://github.com/ENCODE-DCC/long-rna-seq-pipeline/blob/master/dnanexus/rampage/rampage-align-pe/resources/usr/bin/rampage_align_star.sh
$ STAR --genomeDir /data/star_index --readFilesIn read1.fq.gz read2.fq.gz              \
    --readFilesCommand zcat --runThreadN 4 --genomeLoad NoSharedMemory                \
    --outFilterMultimapNmax 500 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1        \
    --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04                   \
    --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000              \
    --outSAMheaderHD @HD VN:1.4 SO:coordinate                                             \
    --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD      \
    --outFilterScoreMinOverLread 0.85 --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
    --clip5pNbases 6 15 --seedSearchStartLmax 30 --outSAMtype BAM SortedByCoordinate         \
    --limitBAMsortRAM 6000000000
Nascent RNA sequencing alignment

GRO-cap captures nascent RNAs, which are usually not spliced. When aligning reads for these libraries, users can set --alignIntronMax 1000 to prevent STAR from creating long spliced reads.

$ STAR --readFilesCommand zcat \
     --runThreadN 16 \
     --genomeDir /data/star_index \
     --outFileNamePrefix SampleA_ \
     --readFilesIn reads.fq.gz  \
     --outSAMmultNmax 1 \
     --outFilterMultimapNmax 50 \
     --alignIntronMax 1000  \
     --outSAMtype BAM SortedByCoordinate
Correcting Mapping Bias with WASP

When we attempt to align DNA sequences to a reference genome, we may encounter issues due to sequence variations. While these discrepancies might only affect a few locations, they can cause significant distortions when we perform genome-wide tests to detect if certain genes are more active than others. STAR has a highly efficient implementation of WASP which can take care of allele-specific mapping when sample genome types are available (feed to STAR by the --varVCFfile option). To enable WASP, specify the --waspOutputMode together with the --varVCFfile option:

$ STAR --readFilesCommand zcat \
 --runThreadN 16 \
 --genomeDir /data/star_index \
 --outFileNamePrefix SampleA_ \
 --readFilesIn reads.fq.gz \
 --waspOutputMode SAMtag \
 --varVCFfile sample.vcf

Now we can use samtools to select reads that are uniquely mapped (-q 255) and have allele-specific bias corrected (either don't overlap with SNPs (![vW]) or pass WASP filter ([vW]==1)):

$ samtools view -h -b -e '![vW] || [vW]==1' -q 255 -o filtered.bam SampleA_Aligned.sortedByCoord.out.bam

Protocols using this tool

PROcap preprocessing (with two replicates)ENCODE RNA-seq (paired-end)
File formats this tool works with
FASTQ

Share your experience or ask a question