Category

Sam/Bam Manipulation


Usage

infer_experiment.py -r gene_annotation.bed -i input.bam [options]


Manual

Required arguments

  • -i INPUT_FILE, --input-file=INPUT_FILE: Input alignment file in SAM or BAM format
  • -r REFGENE_BED, --refgene=REFGENE_BED: Reference gene model in bed fomat. Precompiled gene models for human, mouse, fly, and zebrafish are available on the author's website.

Options

  • -s SAMPLE_SIZE, --sample-size=SAMPLE_SIZE: Number of reads sampled from SAM/BAM file. default=200000
  • -q MAP_QUAL, --mapq=MAP_QUAL: Minimum mapping quality (phred scaled) for an alignment to be considered as “uniquely mapped”. default=30
  • --version: show program’s version number and exit
  • -h, --help: show this help message and exit

How it works

  1. This program is used to “guess” how RNA-seq sequencing were configured, particulary how reads were stranded for strand-specific RNA-seq data, through comparing the “strandness of reads” with the “standness of transcripts”.

  2. The “strandness of reads” is determiend from alignment, and the “standness of transcripts” is determined from annotation.

  3. For non strand-specific RNA-seq data, “strandness of reads” and “standness of transcripts” are independent.

  4. For strand-specific RNA-seq data, “strandness of reads” is largely determined by “standness of transcripts”. See below 3 examples for details.

  5. You don’t need to know the RNA sequencing protocol before mapping your reads to the reference genome. Mapping your RNA-seq reads as if they were non-strand specific, this script can “guess” how RNA-seq reads were stranded.

For pair-end RNA-seq, there are two different ways to strand reads (such as Illumina ScriptSeq protocol):

  1. 1++,1--,2+-,2-+
    • read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
    • read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
    • read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
    • read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
  2. 1+-,1-+,2++,2--

    • read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand

    • read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand

    • read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand

    • read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand

For single-end RNA-seq, there are also two different ways to strand reads:

  1. ++,--

    1. read mapped to ‘+’ strand indicates parental gene on ‘+’ strand

    2. read mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand

  2. +-,-+

    1. read mapped to ‘+’ strand indicates parental gene on ‘-‘ strand

    2. read mapped to ‘-‘ strand indicates parental gene on ‘+’ strand

Examples

Example 1: Pair-end non strand specific:

infer_experiment.py -r hg19.refseq.bed12 -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam

#Output::

This is PairEnd Data
Fraction of reads failed to determine: 0.0172
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925

Interpretation: 1.72% of total reads were mapped to genome regions that we cannot determine the “standness of transcripts” (such as regions that having both strands transcribed). For the remaining 98.28% (1 - 0.0172 = 0.9828) of reads, half can be explained by “1++,1--,2+-,2-+”, while the other half can be explained by “1+-,1-+,2++,2--”. We conclude that this is NOT a strand specific dataset because “strandness of reads” was independent of “standness of transcripts”

Example 2: Pair-end strand specific:

infer_experiment.py -r hg19.refseq.bed12 -i Pairend_StrandSpecific_51mer_Human_hg19.bam

#Output::

This is PairEnd Data
Fraction of reads failed to determine: 0.0072
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9441
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0487

Interpretation: 0.72% of total reads were mapped to genome regions that we cannot determine the “standness of transcripts” (such as regions that having both strands transcribed). For the remaining 99.28% (1 - 0.0072 = 0.9928) of reads, the vast majority was explained by “1++,1--,2+-,2-+”, suggesting a strand-specific dataset.

Example 3: Single-end strand specific:

infer_experiment.py -r hg19.refseq.bed12 -i SingleEnd_StrandSpecific_36mer_Human_hg19.bam

#Output::

This is SingleEnd Data
Fraction of reads failed to determine: 0.0170
Fraction of reads explained by "++,--": 0.9669
Fraction of reads explained by "+-,-+": 0.0161

Interpretation: This is single-end, strand specific RNA-seq data. Strandness of reads are concordant with strandness of reference gene.


Share your experience or ask a question