Skip to content

Gene Feature Enumeration

sjmack edited this page Feb 21, 2015 · 2 revisions

Gene Feature Enumeration (GFE)

A proposal has been made for a system of enumerated gene features (untranslated regions [UTRs], exons and introns) as an extension of the HLA allele nomenclature (http://biorxiv.org/content/early/2015/02/15/015222).

We expanded and refined the elements of the original GFE proposal as summarized here GFE_update_02202015.pdf.

DaSH II Revisions

  • Change GFE notation for partial sequences from a decimal (e.g., 8.443) enumeration to a separate enumeration of partial sequences denoted with p, for 'partial' (e.g., p1, p2, p3). A partial sequence is defined as a sequence that is not full-length for a given feature due to a limitation of the typing methodology (e.g., different primer locations). Since a partial sequence can potentially match multiple full-length feature sequences, it may not be valid to identify a given partial sequence as a short version of a particular full-length feature.
  • Treat unavailable/untyped/untested sequence for a feature as a partial sequence, and denote these as p0. Essentially, a unavailable sequence is a potential match to all full-length feature sequences.
  • Treat indels as sequence variants and enumerate them as full sequences; these sequence are not full length for a given feature due to biological variation.
  • Similarly treat deleted features as legitimate sequence variants and enumerate them as full sequences.
  • Treat duplications of sequence features (e.g., two intron 1(i1) and exon 2 (e2) sequences) in a single gene as nucleotide variants of the second duplicated feature; see GFE_update_02202015.pdf. If i2 and e2 are duplicated (e.g., 5'UTRe1i1e2i1e23'UTR), treat the second i1~e2 as part of the sequence of the first e2. This maintains the field structure for each gene.
  • Change the delimiter from colons (:) to semi-colons (;) to further distinguish GFE notation from allele names.

Considerations for a GFE Service

We also discussed ways to implement an effective GFE service, and apparent obstacles to an effective serivce.

  • It is not clear how the respective 5' 3' ends of the 5' and 3' UTRs are defined in the IMGT/HLA Database. The basis of such definitions needs to be clarified for the purpose of defining a full length UTR sequence.
  • In order to distinguish short feature sequences that distinguish legitimate length variants from partial sequences, the service will need to inspect short sequences for indels via comparison to a reference sequence.
  • To persist enumerations (and therefore GFE notations) between IMGT/HLA Database release updates, all numbered full-length and partial GFEs should first be re-evaluated against the new database annotations; new, extended or deleted sequences in that database release are evaluated after all extant enumerations have been evaluated, and new (higher number) full-length and partial enumerations assigned.
  • It would be effective to hash each feature sequence, and then enumerate each unique hashed sequence.
  • Each hashed sequence feature, and its associated enumeration, should be maintained in the GFE service, even if it appears to have been superseded by a change in the reference database.

DaSH

Clone this wiki locally