-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remainder mouse annotation file from uniprot upstreams to MGI data stream #329
Comments
Tagging @sierra-moxon @ukemi |
@leemdi noticed today that the current mouse GPAD file, http://snapshot.geneontology.org/annotations/mgi.gpad.gz, is missing all of the column 12 data. When it exists, these data should include information about Noctua models. The noctua_mgi.gpad: The same annotation from the above referenced file: Looks like the file format for the mgi.gpad file is out of date. |
I don’t see noctua-model-id, etc. in mgi.gaf either. I only see them in noctua_mgi.gpad |
The gaf won't have the Noctua info. I'm not even sure it is possible to have it in there (I think in GPAD these days). But it looks like the mgi.gpad file that is on snapshot (above) needs quite a bit of work and therefore failed @leemdi 's tests. As @sierra-moxon works through this project, hopefully it will come up to snuff like the noctua_mgi.gpad file. |
@ukemi Yes, the GPAD coming from that direction would generally have received less attention as that has not been an official exchange product. That said, the format utilization should be harmonized, insofar as the information exists in the first place for the GAF-derived GPAD lines vs the GO-CAM-derived GPAD lines. @sierra-moxon @dustine32 @ukemi , I think it's probably best to have another ticket added to the project that's GPAD output harmonization. I'm not sure who best to tackle that; it has likely not been of much concern since Eric last worked on it some time ago. @sierra-moxon, if you wanted to take a look at it first, that might work; otherwise, we can bring on @dustine32. Does that sound right to the both of you? |
sounds good. |
Note that the format should harmonize with the noctua_mgi.gpad file since this is the one we currently load. |
Fiddly note on note that the format should be "correct" with respect to the spec, which should be reflected in our outputs. That may not be a completely trivial task. |
Per feedback from @kltm: |
So two req'ts from David:
|
Summary from yesterday's call with @ukemi @sierra-moxon @kltm @leemdi
The GOC (@sierra-moxon ) will need to do the following processing:
-QC/Awareness
|
@sierra-moxon and @kltm I have examined the GOA isoform file a bit today and most of the identifiers in it are not in our GPI file, although they are associated with MGI genes, for example UniProtKB:A0A075B5J2 is associated with MGI:98596, but it is not in the GPI file. The header of the isoform file says: Looking at the annotations in this file, it looks like many of them are from TrEMBL and are not loaded into MGI but in many cases the gene already has a similar annotation in the standard gaf. is from an un-reviewed TrEMBL record and is not loaded into MGI, even though there is a sequence->gene association. We need to ask @leemdi, but I think we filter annotations to TrEMBL records when we load. @leemdi this might be in the GOA-mouse load or in the UniProt load since they both generate GO annotations. Does the GOC want to load information from un-reviewed TrEMBL records? If so, there are at least two strategies we could take:
|
@sierra-moxon it would be really helpful for me if at first you could process the goa_mouse_isoform.gaf and the goa_mouse.gaf separately. I'd like to be able to tell how many annotations from each file are filtered. |
A0A024QYR9 is not being loaded into MGI as a GO/Mouse annotation because it fell into the "nopubmed.error" . field 2,6 A0A024QYR9 GO_REF:0000003 |
here is the uberon -> emapa obo file we use: https://purl.obolibrary.org/obo/uberon.obo there is an "xref:" for EMAPA terms |
I just downloaded our most recent mouse gaf from the MGI reports page: I wanted to get a count of how many annotations were in the file that used UniProt identifiers as the primary annotation object (column 2), which means we couldn't map them to mouse genes. To my surprise, there were none. @leemdi, can you confirm that we still add the 'unloadable' annotations to the end of the MGI gaf? @sierra-moxon , if this is the case, it would add another filtering step for annotations where the UniProt identifier doesn't map to an MGI gene. |
The filtering step was confirmed on this morning's managers' call at the GOC level. If there are annotations that use a protein identifier and it cannot be mapped to a mouse gene, then we shouldn't include those in the annotation corpus. |
I have examined the annotations that have GO-REFs from the main file that we import from uniprot ([goa_mouse.gaf]. Summary:
|
GOA GAF conversion results (two files b/c we process isoform file separately): https://drive.google.com/drive/folders/1pcMltYV_mKbIzPF19W4exKHd6TEsGLa- |
Thanks @sierra-moxon |
I take this back. It looks like these are legitimate annotations to isoforms. However, many are redundant with the non-isoform annotations. Redundancy rears its ugly head again. We would convert the IDs to PRO ids since that is what we use for protein representation in MGI and therefore is what is in our GPI file. For example UniProtKB:P54830-1=PR:P54830-1. I notice that you made the provider GO_Central. This is not the case for these annotations because they are manual annotations made by individual groups. They should retain the original source. This one: I think we will have to skip ones like this because they won't resolve to PRO identifiers, but I need to investigate this more: |
Thinking about the step where we replace UniProtKB:$$$$$$$ with PR:$$$$$$$ in the with field and the isoform field. It's not just a simple switch.
A second way to do this would be to use the GPI file. Since it represents 'truth', any UniProtKB identifier in the 'proteoform' or 'with' filed that can be switched would be in the GPI file in lines that look like this: |
Notes from yesterday's call:
|
@LiNiMGI and I just manually spot-checked the isoform annotations file above and determined that we should include these annotations. We realize that some will not map to MGI annotatable objects such as UniProtKB:P55095-PRO_0000011280, but most will. So an annotation like this: Will look something like this in the final GPAD: Note I changed the provider to match what is in the original annotation: |
@sierra-moxon -the file we checked this morning is the current "goa mouse converted - isoform" file in google drive, which does not include the "with no dashes" ones? |
I think you should weed them out because I don't think they always represent an isoform and as we saw yesterday, we don't really know if they represent the gene or the protein because both are in our GPI. Hopefully @LiNiMGI agrees, but I think we should follow the strict rule about the GPI representing the valid annotation objects in MGI. However, I do wonder if those that don't have the dashes are represented by what is essentially a duplicate annotation in the non-isoform file. |
@sierra-moxon, if you want us to be doubly sure, we need some examples of non-dash ones to trace. Can you provide a couple examples of ones that are in the GPI but don't have dashes? |
We just found the file from yesterday. @LiNiMGI and I just looked at some of the non-dash ones and they are bona-fide isoforms. Line in GPI: |
So the new rule is not that we should only take "dashed" UniProt IDs, but that we need to check if the UniProt ID is associated with only a PRO id (those associated with an MGI as well, should be weeded out). Sorry to be pedantic 🥴 - does this new rule apply to the "dashed" UniProt ids as well? |
No problem! I apologize that I've managed to make this really confusing. Hopefully you can do this:
Here is one that maps to both in the GPI UniProtKB:Q8VHW3 (skip): PR:Q8VHW3 mCACNG6 voltage-dependent calcium channel gamma-6 subunit (mouse) mCACNG6|neuronal voltage-gated calcium channel gamma-6 subunit (mouse) PR:000000001 NCBITaxon:10090 MGI:MGI:1859168 UniProtKB:Q8VHW3 Here is one that maps to only one in the GPI UniProtKB:A0A087WRD7 (keep): |
Ticket opened for annotations that are missing in the new load, discovered by @leemdi. This is not on our end, but should be noted as part of the project. |
Hi @sierra-moxon. I can't remember now whether we left the 'validation' of annotations in the isoform file at only taking ones that had a hyphenated suffix. Today while looking at one of our QC reports, I found some isoforms that don't have a suffix. For example, UniProtKB:D3YX90 is in our GPI cross-referenced to what appears to be a legitimate isoform in PRO: PR:D3YX90 mADAMTS17 a disintegrin and metalloproteinase with thrombospondin motifs 17 (mouse) mADAMTS17 PR:000000001 NCBITaxon:10090 MGI:MGI:3588195 UniProtKB:D3YX90 It is not cross-referenced to an MGI marker/gene directly. At the end of the day, the best way to determine if an entity is valid for annotation in MGI remains to be whether you can find it in the GPI file and the two files from the UniProt upstream would be processed differently, the non-isoform file would be checked for x-ref to a gene (MGI:MGI:) and the isoform file would be checked against PR: identifiers as above. |
@LiNiMGI Can you please check whether these are being injected? |
@sierra-moxon @ukemi @leemdi
A. B. UniProtKB:O54824-PRO_0000015418 RO:0002331 GO:0019221 PMID:30089723 ECO:0000314 2021-04-30 ARUK-UCL BFO:0000066(UBERON:0002048) At the moment, MGI will filter those annotations out since UniProtKB:O54824-PRO_0000015418 is not in our GPI. C. |
A. - No, no report for skipped annotations in the preprocess pipeline (we do have the reports from the GORules). B.
in preprocess pipeline, I use the mouse GPI to map
then in the GPAD emission, discussed here, I replace the "subject.id" with the value of the isoform identifier:
|
Thanks @sierra-moxon, Li will find out whether we can get a PR:ID for them. Li |
Discussing with @LiNiMGI The task is to convert UniProt chains to PRO ID. Once PRO (Protein Ontology) IDs are available for these UniProt chains (UniProt:O####), then Li will convert the |
Add remainder mouse annotation file from uniprot upstreams to MGI data stream.
Listed as:
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/goa_mouse.gaf.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/goa_mouse_isoform.gaf.gz
The text was updated successfully, but these errors were encountered: