gtfToGenePred [options] gtf genePred
This tool is part of UCSC Genome Browser's utilities.
See also: gff3ToGenePred
The Picard CollectRnaSeqMetrics tool produces metrics describing the distribution of the bases within the transcripts. To make the tool work, it requires a REF_FLAT file, a tab-delimited file containing information about the location of RNA transcripts, exon start and stop sites, etc. In the following example, we will build the REF_FLAT file for mouse using the gene annotation file from Ensembl:
# first download the annotation file $ wget ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz # decompress it $ gunzip Mus_musculus.GRCm38.102.gtf.gz # build the ref_flat file with gtfToGenePred and awk $ gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons Mus_musculus.GRCm38.102.gtf /dev/stdout | \ awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' > Mus_musculus.GRCm38.102.gtf.refflat # check the output $ head Mus_musculus.GRCm38.102.gtf.refflat 4933401J01Rik ENSMUST00000193812 1 + 3073252 3074322 3074322 3074322 1 3073252, 3074322, Gm26206 ENSMUST00000082908 1 + 3102015 3102125 3102125 3102125 1 3102015, 3102125, Xkr4 ENSMUST00000162897 1 - 3205900 3216344 3216344 3216344 2 3205900,3213608, 3207317,3216344, Xkr4 ENSMUST00000159265 1 - 3206522 3215632 3215632 3215632 2 3206522,3213438, 3207317,3215632, Xkr4 ENSMUST00000070533 1 - 3214481 3671498 3216021 3671348 3 3214481,3421701,3670551, 3216968,3421901,3671498, Gm18956 ENSMUST00000192857 1 + 3252756 3253236 3253236 3253236 1 3252756, 3253236, Gm37180 ENSMUST00000195335 1 - 3365730 3368549 3368549 3368549 1 3365730, 3368549, Gm37363 ENSMUST00000192336 1 - 3375555 3377788 3377788 3377788 1 3375555, 3377788, Gm37686 ENSMUST00000194099 1 - 3464976 3467285 3467285 3467285 1 3464976, 3467285, Gm1992 ENSMUST00000161581 1 + 3466586 3513553 3513553 3513553 2 3466586,3513404, 3466687,3513553,
In the following example, we'll build an ANNOVAR database for Arabidopsis (ref):
# go to http://plants.ensembl.org/info/website/ftp/index.html to download the GTF file and the genome FASTA file for this plant into a folder called atdb.
$ mkdir atdb
$ cd atdb
$ wget ftp://ftp.ensemblgenomes.org/pub/release-27/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz
$ wget ftp://ftp.ensemblgenomes.org/pub/release-27/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.27.gtf.gz
# decompress both files
$ gunzip Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz
$ gunzip Arabidopsis_thaliana.TAIR10.27.gtf.gz
# use the gtfToGenePred tool to convert the GTF file to GenePred file
$ gtfToGenePred -genePredExt Arabidopsis_thaliana.TAIR10.27.gtf
AT_refGene.txt
# generate a transcript FASTA file with the script provided by ANNOVAR
$ perl retrieve_seq_from_fasta.pl --format refGene --seqfile Arabidopsis_thaliana.TAIR10.27.dna.genome.fa AT_refGene.txt --out AT_refGeneMrna.fa
After this step, the annotation database files needed for gene-based annotation are ready. Now you can annotate a given VCF file. Please note that the --buildver argument should be set to AT
.