gtfToGenePred manual with usage examples

Usage

gtfToGenePred [options] gtf genePred

Manual

This tool is part of UCSC Genome Browser's utilities.

Required arguments

gtf: The input GTF file that you want to convert to genePred format
genePred: Destination for the output genePred file

Options

-genePredExt: create a extended genePred, including frame information and gene name
-allErrors: skip groups with errors rather than aborting. Useful for getting infomation about as many errors as possible.
-ignoreGroupsWithoutExons: skip groups contain no exons rather than generate an error.
-infoOut=file: write a file with information on each transcript
-sourcePrefix=pre: only process entries where the source name has the specified prefix. May be repeated.
-impliedStopAfterCds: implied stop codon in after CDS
-simple: just check column validity, not hierarchy, resulting genePred may be damaged
-geneNameAsName2: if specified, use gene_name for the name2 field instead of gene_id.

Examples

Create the REF_FLAT file required by CollectRnaSeqMetrics (picard)

The Picard CollectRnaSeqMetrics tool produces metrics describing the distribution of the bases within the transcripts. To make the tool work, it requires a REF_FLAT file, a tab-delimited file containing information about the location of RNA transcripts, exon start and stop sites, etc. In the following example, we will build the REF_FLAT file for mouse using the gene annotation file from Ensembl:

# first download the annotation file
$ wget ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz
# decompress it
$ gunzip Mus_musculus.GRCm38.102.gtf.gz
# build the ref_flat file with gtfToGenePred and awk
$ gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons Mus_musculus.GRCm38.102.gtf /dev/stdout | \
    awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' > Mus_musculus.GRCm38.102.gtf.refflat
# check the output
$ head Mus_musculus.GRCm38.102.gtf.refflat
4933401J01Rik	ENSMUST00000193812	1	+	3073252	3074322	3074322	3074322	1	3073252,	3074322,
Gm26206	ENSMUST00000082908	1	+	3102015	3102125	3102125	3102125	1	3102015,	3102125,
Xkr4	ENSMUST00000162897	1	-	3205900	3216344	3216344	3216344	2	3205900,3213608,	3207317,3216344,
Xkr4	ENSMUST00000159265	1	-	3206522	3215632	3215632	3215632	2	3206522,3213438,	3207317,3215632,
Xkr4	ENSMUST00000070533	1	-	3214481	3671498	3216021	3671348	3	3214481,3421701,3670551,	3216968,3421901,3671498,
Gm18956	ENSMUST00000192857	1	+	3252756	3253236	3253236	3253236	1	3252756,	3253236,
Gm37180	ENSMUST00000195335	1	-	3365730	3368549	3368549	3368549	1	3365730,	3368549,
Gm37363	ENSMUST00000192336	1	-	3375555	3377788	3377788	3377788	1	3375555,	3377788,
Gm37686	ENSMUST00000194099	1	-	3464976	3467285	3467285	3467285	1	3464976,	3467285,
Gm1992	ENSMUST00000161581	1	+	3466586	3513553	3513553	3513553	2	3466586,3513404,	3466687,3513553,

Create ANNOVAR databases for non-human species

In the following example, we'll build an ANNOVAR database for Arabidopsis (ref):

# go to http://plants.ensembl.org/info/website/ftp/index.html to download the GTF file and the genome FASTA file for this plant into a folder called atdb.
$ mkdir atdb
$ cd atdb
$ wget ftp://ftp.ensemblgenomes.org/pub/release-27/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz
$ wget ftp://ftp.ensemblgenomes.org/pub/release-27/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.27.gtf.gz

# decompress both files
$ gunzip Arabidopsis_thaliana.TAIR10.27.dna.genome.fa.gz
$ gunzip Arabidopsis_thaliana.TAIR10.27.gtf.gz

# use the gtfToGenePred tool to convert the GTF file to GenePred file
$ gtfToGenePred -genePredExt Arabidopsis_thaliana.TAIR10.27.gtf AT_refGene.txt

# generate a transcript FASTA file with the script provided by ANNOVAR
$ perl retrieve_seq_from_fasta.pl --format refGene --seqfile Arabidopsis_thaliana.TAIR10.27.dna.genome.fa AT_refGene.txt --out AT_refGeneMrna.fa

After this step, the annotation database files needed for gene-based annotation are ready. Now you can annotate a given VCF file. Please note that the --buildver argument should be set to AT.

gtfToGenePred

Category