gff3ToGenePred inGff3 outGp
This tool is part of UCSC Genome Browser's Utility Tools.
- -warnAndContinue: on bad genePreds being created, put out warning but continue
- -useName: rather than using 'id' as name, use the 'name' tag
- -rnaNameAttr=attr: If this attribute exists on an RNA record, use it as the genePred name column
- -geneNameAttr=attr: If this attribute exists on a gene record, use it as the genePred name2 column
- -attrsOut=file: output attributes of mRNA record to file. These are per-genePred row, not per-GFF3 record. Thery are derived from GFF3 attributes, not the attributes themselves.
- -processAllGeneChildren: output genePred for all children of a gene regardless of feature
- -unprocessedRootsOut=file: output GFF3 root records that were not used. This will not be a valid GFF3 file. It's expected that many non-root records will not be used and they are not reported.
- -bad=file: output genepreds that fail checks to file
- -maxParseErrors=50: Maximum number of parsing errors before aborting. A negative value will allow an unlimited number of errors. Default is 50.
- -maxConvertErrors=50: Maximum number of conversion errors before aborting. A negative value will allow an unlimited number of errors. Default is 50.
- -honorStartStopCodons: only set CDS start/stop status to complete if there are corresponding start_stop codon records
- -defaultCdsStatusToUnknown: default the CDS status to unknown rather than complete.
- -allowMinimalGenes: normally this programs assumes that genes contains transcripts which contain exons. If this option is specified, genes with exons as direct children of genes and stand alone genes with no exon or transcript children will be converted.
- -refseqHacks: enable various hacks to make RefSeq conversion work: This turns on -useName, -allowMinimalGenes, and -processAllGeneChildren. It try harder to find an accession in attributes
gff3ToGenePred converts the following records in a gff3 file:
- top-level gene records with RNA records
- top-level RNA records
- RNA records that contain:
- exon and CDS
- CDS, five_prime_UTR, three_prime_UTR
- only exon for non-coding
- top-level gene records with transcript records
- top-level transcript records
- transcript records that contain
where RNA can be mRNA, ncRNA, or rRNA, and transcript can be either transcript or primary_transcript. The first step is to parse GFF3 file, up to 50 errors are reported before aborting. If the GFF3 files is successfully parse, it is converted to gene, annotation. Up to 50 conversion errors are reported before aborting.
Input file must conform to the GFF3 specification: http://www.sequenceontology.org/gff3.shtml
Convert GENCODE long non-coding RNAs annotations (in GFF3 format) to GenePred format:
# download annotations in gff3 format
# convert GFF3 to genePred, making sure to include -geneNameAttr=gene_name
# so that gene symbol is used as the name2 instead of ID number, and sorting by chromosome and position:
gff3ToGenePred -geneNameAttr=gene_name gencode.v32.long_noncoding_RNAs.gff3.gz stdout | sort -k2,2 -k4n,4n > gencode.v32.lncRNAs.genePred
You can then use
genePredToBed to convert the annotations into standard bed format.
Share your experience or ask a question