Reference Code backup Executable files
“guess” how RNA-seq sequencing were configured, particulary how reads were stranded for strand-specific RNA-seq data
infer_experiment.py -r gene_annotation.bed -i input.bam [options]
This program is used to “guess” how RNA-seq sequencing were configured, particulary how reads were stranded for strand-specific RNA-seq data, through comparing the “strandness of reads” with the “standness of transcripts”.
The “strandness of reads” is determiend from alignment, and the “standness of transcripts” is determined from annotation.
For non strand-specific RNA-seq data, “strandness of reads” and “standness of transcripts” are independent.
For strand-specific RNA-seq data, “strandness of reads” is largely determined by “standness of transcripts”. See below 3 examples for details.
You don’t need to know the RNA sequencing protocol before mapping your reads to the reference genome. Mapping your RNA-seq reads as if they were non-strand specific, this script can “guess” how RNA-seq reads were stranded.
For pair-end RNA-seq, there are two different ways to strand reads (such as Illumina ScriptSeq protocol):
1+-,1-+,2++,2--
read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
For single-end RNA-seq, there are also two different ways to strand reads:
++,--
read mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
+-,-+
read mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
Example 1: Pair-end non strand specific:
infer_experiment.py -r hg19.refseq.bed12 -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam #Output:: This is PairEnd Data Fraction of reads failed to determine: 0.0172 Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903 Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925
Interpretation: 1.72% of total reads were mapped to genome regions that we cannot determine the “standness of transcripts” (such as regions that having both strands transcribed). For the remaining 98.28% (1 - 0.0172 = 0.9828) of reads, half can be explained by “1++,1--,2+-,2-+”, while the other half can be explained by “1+-,1-+,2++,2--”. We conclude that this is NOT a strand specific dataset because “strandness of reads” was independent of “standness of transcripts”
Example 2: Pair-end strand specific:
infer_experiment.py -r hg19.refseq.bed12 -i Pairend_StrandSpecific_51mer_Human_hg19.bam #Output:: This is PairEnd Data Fraction of reads failed to determine: 0.0072 Fraction of reads explained by "1++,1--,2+-,2-+": 0.9441 Fraction of reads explained by "1+-,1-+,2++,2--": 0.0487
Interpretation: 0.72% of total reads were mapped to genome regions that we cannot determine the “standness of transcripts” (such as regions that having both strands transcribed). For the remaining 99.28% (1 - 0.0072 = 0.9928) of reads, the vast majority was explained by “1++,1--,2+-,2-+”, suggesting a strand-specific dataset.
Example 3: Single-end strand specific:
infer_experiment.py -r hg19.refseq.bed12 -i SingleEnd_StrandSpecific_36mer_Human_hg19.bam #Output:: This is SingleEnd Data Fraction of reads failed to determine: 0.0170 Fraction of reads explained by "++,--": 0.9669 Fraction of reads explained by "+-,-+": 0.0161
Interpretation: This is single-end, strand specific RNA-seq data. Strandness of reads are concordant with strandness of reference gene.