Prepare transcript references for RSEM and optionally build BOWTIE/BOWTIE2/STAR indices.
rsem-prepare-reference [options] reference_fasta_file(s) reference_name
RSEM
will read all files with suffix ".fa" or ".fasta" in this directory. The files should contain either the sequences of transcripts or an entire genome, depending on whether the --gtf option is used.RSEM
will generate several reference-related files that are prefixed by this name. This name can contain path information (e.g. '/ref/mm9').RSEM
assumes that reference_fasta_file(s) contains the sequence of a genome, and will extract transcript reference sequences using the gene annotations specified in file, which should be in GTF format.RSEM
will assume reference_fasta_file(s) contains the reference transcripts. In this case, RSEM
assumes that name of each sequence in the Multi-FASTA files is its transcript_id. (Default: off)RSEM
will first convert it to GTF format with the file name reference_name.gtf. Please make sure that reference_name.gtf does not exist. (Default: off)gene_id transcript_idwith the two fields separated by a tab character.
RSEM
uses the "gene_id" and "transcript_id" attributes in the GTF file. Otherwise, RSEM
assumes that each sequence in the reference sequence files is a separate gene. (Default: off)gene_id transcript_id allele_idwith the fields separated by a tab character.
Files used:
. ├── ENCFF159KBI.gtf └── fas ├── ENCFF001RTP.fasta ├── ENCFF335FFV.fasta └── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
Run the following command will generate index (with a reference_name rsem) in the current folder:
rsem-prepare-reference --gtf ENCFF159KBI.gtf fas ./rsem
If successful index files will be saved to the folder:
. ├── ENCFF159KBI.gtf ├── fas │ ├── ENCFF001RTP.fasta │ ├── ENCFF335FFV.fasta │ └── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta ├── rsem.chrlist # if --gtf is on ├── rsem.grp ├── rsem.idx.fa ├── rsem.n2g.idx.fa ├── rsem.seq ├── rsem.ti └── rsem.transcripts.fa # the extracted reference transcripts in Multi-FASTA format
Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'. We want to put the generated reference files under '/ref' with name 'mouse_0'. We do not add any poly(A) tails. Please note that GTF files generated from UCSC's Table Browser do not contain isoform-gene relationship information. For the UCSC Genes annotation, this information can be obtained from the knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie executables are found in '/sw/bowtie'.
There are two ways to write the command:
rsem-prepare-reference --gtf mm9.gtf \ --transcript-to-gene-map knownIsoforms.txt \ --bowtie \ --bowtie-path /sw/bowtie \ /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \ /ref/mouse_0
Or
rsem-prepare-reference --gtf mm9.gtf \ --transcript-to-gene-map knownIsoforms.txt \ --bowtie \ --bowtie-path /sw/bowtie \ /data/mm9 \ /ref/mouse_0