Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OMIM gene references #8624

Merged
merged 5 commits into from
Jan 30, 2025
Merged

Conversation

twhetzel
Copy link
Collaborator

Update OMIM gene references.

@twhetzel twhetzel requested a review from matentzn January 24, 2025 16:52
@@ -143773,8 +143773,8 @@ intersection_of: MONDO:0015281 ! atrial standstill
intersection_of: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/10593 ! SCN5A
intersection_of: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/4279 ! GJA5
relationship: excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/10593 {source="OMIM:108770"} ! SCN5A
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/4279 {source="MONDO:mim2gene_medgen", source="OMIM:108770"} ! GJA5
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/10593 ! SCN5A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we decided not to preserve gene references if there is no provenance at all? This goes for various examples in this PR..

Copy link
Collaborator Author

@twhetzel twhetzel Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gene association is for MONDO:0007171 'atrial standstill 1'.

OMIM:108770 has "digenic" in the disease description. @joeflack4 I thought those were being filtered out and added into the "review.tsv" file in the OMIM repo, is that not the case in general?

I am also not finding this association in the data file from the 2025-01-19 OMIM ingest release https://github.com/monarch-initiative/omim/releases/tag/2025-01-19.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did note that any updates where there is >1 subclassof gene association will need to be reviewed manually

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it seems like both the gene associations added as a logical definition and a subclassOf relation should be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joeflack4 after thinking a bit more about this, can you remind me:
(1) if any disease description that includes "digenic" is added into the review.tsv file
(2) if it is also added into the ROBOT file in order to create the gene association
(3) how confirmed disease defining "digenic" like https://omim.org/entry/601067 are handled

If this is in the docs, feel free to point me to that first and I can then add any follow-up questions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can start looking into this in a couple hours. For right now, just wanted to let you know that the documentation for the review.tsv is on the omim repo readme towards the bottom.

If something is not in the January 19th data files, and the Mondo ingest build is newer than that, and that is problematic.

For digenic, we did for a short time have a special rule where we were adding associations when digenic was in the label, even if there was more than one association. So it was an exception rule. But we removed that exception. However, we do not have any explicit filtering. So if something is, perhaps erroneously, labeled as digenic, but only has one association, and meets the other logical conditions for a disease defining Gene, then the association will be created. That's my guess as to why this one is appearing here but I will look into it further.

Copy link
Collaborator

@joeflack4 joeflack4 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: OMIM:108770

So, we had to exclude this case specifically. Because, even though it is marked "digenic", it meets all of the conditions for a disease-defining gene (including 1 entry in morbidmap.txt, which is unusual for one marked 'digenic').

Thus, this is why it is not appearing in the data release, as Trish mentioned.

And, if you look at the diff highlighted in this thread, you can see that the OMIM source evidence indeed is removed. This is correct. However, relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/10593 still remains, even without any source evidence. So this does not appear to be a bug with omim or mondo-ingest, but with the mondo pipeline. Perhaps it needs to be adjusted such that if there is no evidence for the gene association, it is removed?

I didn't work on this pipeline previously but if you want I can tweak the pipeline.

Copy link
Collaborator

@joeflack4 joeflack4 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) if any disease description that includes "digenic" is added into the review.tsv file

Answer: Yes

Just to clarify, by "description", what you mean is the Phenotype label in the "Phenotype-Gene" table on the OMIM entry page, or the title in the "Phenotype" column of morbidmap.txt.

So, what's added to the review.tsv for 'digenic' is described here.

Basically, if it's marked digenic, but has >1 association, we filter it out. It won't appear in the review.tsv. If is marked digenic but has only 1 association and meets the other requirements for disease-defining (no [, {, or ?, not in our explicit exclusions, and mapping key = 3), then we do add it as a disease-defining association AND we add an entry for it in review.tsv.

(2) if it is also added into the ROBOT file in order to create the gene association

Answer: Yes (described more above)

(3) how confirmed disease defining "digenic" like https://omim.org/entry/601067 are handled

There is no logic for "confirmed" cases of disease-defining associations (digenic or otherwise). We only have exclusions. How the exclusions work is that even if something otherwise meets the conditions for disease-defining, it will be excluded if it appears in that TSV.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MONDO:0007171 is in the exclusion file a gene association was not intended to be added. It has now been manually removed.

relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/29090 {source="MONDO:mim2gene_medgen", source="OMIM:158901"} ! SMCHD1
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/50800 {source="OMIM:158901"} ! DUX4
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/29090 {source="MONDO:mim2gene_medgen"} ! SMCHD1
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/50800 ! DUX4
Copy link
Collaborator Author

@twhetzel twhetzel Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These gene associations are for MONDO:0008031 'facioscapulohumeral muscular dystrophy 2'.

OMIM:158901 also has "digenic" in the disease description.

It seems these gene associations should also be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my tentative explanation above #8624 (comment)

If it is the case that these are, perhaps erroneously labeled as digenic, but otherwise meet all of the logical conditions for which we would normally add a disease defining association, do you want me to add an explicit filter to remove anything that happens to have digenic in the label?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MONDO:0008031 is in the exclusion file a gene association was not intended to be added. It has now been manually removed.

@@ -270747,7 +270751,7 @@ xref: UMLS:C3693482 {source="MEDGEN:811326", source="MONDO:equivalentTo", source
is_a: MONDO:0000653 {source="MONDO:Redundant", source="MONDO:indirect"} ! integumentary system cancer
is_a: MONDO:0005164 {source="DOID:3507"} ! fibrosarcoma
relationship: excluded_subClassOf MONDO:0019300 {source="Orphanet:31112", source="https://orcid.org/0000-0001-5208-3432"} ! obsolete rare skin tumor or hamartoma
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/8800 {source="MONDO:mim2gene_medgen", source="OMIM:607907"} ! PDGFB
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/8800 {source="MONDO:mim2gene_medgen"} ! PDGFB
Copy link
Collaborator Author

@twhetzel twhetzel Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for MONDO:0011934 'dermatofibrosarcoma protuberans'. Based on OMIM:607907 and that the mapping key is 4, this gene association should also be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evidence was removed, but the mondo pipeline likely needs to be updated now (see: thoughts).

Copy link
Collaborator Author

@twhetzel twhetzel Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the OMIM evidence was removed since this has a mapping key of 4 and does not meet the guidelines for disease defining, but due to the gene association also having source="MONDO:mim2gene_medgen" and that the OMIM pipeline would not have removed the association even if there were no source provenance this gene association remains currently as a result of the pipeline.

This gene association was manually removed.

@@ -326240,7 +326244,7 @@ xref: Orphanet:397959 {source="MONDO:equivalentTo", source="OMIM:615387"}
xref: UMLS:C3809332 {source="MONDO:equivalentTo", source="MONDO:MEDGEN", source="MEDGEN:815662"}
is_a: MONDO:0018814 {source="Orphanet:397959", source="https://orcid.org/0000-0001-5208-3432"} ! non-SCID combined immunodeficiency
relationship: curated_content_resource https://search.clinicalgenome.org/kb/conditions/MONDO:0014160 {source="MONDO:CLINGEN"}
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/12029 {source="MONDO:mim2gene_medgen", source="OMIM:615387"} ! TRAC
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/12029 {source="MONDO:mim2gene_medgen"} ! TRAC
Copy link
Collaborator Author

@twhetzel twhetzel Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://omim.org/entry/614102, I don't see why the OMIM gene was not updated to IGKC. @joeflack4 do you see why the gene association was not changed to IGKC with the source OMIM:614102?

Copy link
Collaborator

@joeflack4 joeflack4 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, firstly, the gene-disease association evidence from OMIM was removed in this case.

If you were asking why the label TRAC wasn't changed to IGKC

I don't think this is what you were asking, but:

When you say "changed to IGKC", do you mean that you would expect to see an association to the HGNC class which has the label "IGKC" should be added (to this Mondo term, or another one)?
Or that you should see the label ! TRAC changed to ! IGKC in mondo-edit.obo on this highlighted line has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/12029? If the latter, then I think that this is not a function of the mondo omim-genes pipeline, which uses omim-gene-equivalence.ru. It doesn't look like the pipeline has any functionality to update the labels for these associations.

Note that the evidence you have highlighted here was for OMIM:615387, not OMIM:614102.

MONDO:0013576 has relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/5716 {source="MONDO:mim2gene_medgen"} ! IGKC. MONDO:0014160 has relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/12029 {source="MONDO:mim2gene_medgen"} ! TRAC

It took me a bit of time to see why there is no OMIM source evidence being added for IGKC.

The reason is that mim2gene.txt doesn't show a relationship to HGNC:5716 (or any HGNC symbol or ID) for the gene (OMIM:147200) mapped to OMIM:614102. It only has an NCBI gene entry:

MIM Number MIM Entry Type (see FAQ 1.3 at https://omim.org/help/faq) Entrez Gene ID (NCBI) Approved Gene Symbol (HGNC) Ensembl Gene ID (Ensembl)
147200 gene 3514 ENSG00000211592

For reference, the morbidmap.txt mapping:
Kappa light chain deficiency, 614102 (3) IGKC, IGKCD 147200 2p11.2

Note that you can see "IGCK" in that row, but it us under the "Gene/Locus And Other Related Symbols" column, which we do not parse.

This does result in the following entry added to omim.owl: AnnotationAssertion(<http://www.w3.org/2004/02/skos/core#exactMatch> <https://omim.org/entry/147200> <https://www.ncbi.nlm.nih.gov/gene/3514>)

However, we don't have any mappings between NCBI genes and HGNC IDs in OMIM. So this doesn't get captured by the omim-genes pipeline. See mondo-omim-genes.sparql:

SELECT DISTINCT ?mondo_id ?hgnc_id ?omim_disease_xref ?omim_gene
WHERE
{
  ?omim_disease a owl:Class .
  ?omim_disease skos:exactMatch ?mondo_id .
  ?omim_disease rdfs:subClassOf [
        owl:onProperty RO:0004003 ;
        owl:someValuesFrom ?omim_gene
  ] .
  ?omim_gene skos:exactMatch ?hgnc_id .
  FILTER(STRSTARTS(STR(?hgnc_id), "http://identifiers.org/hgnc/"))

All of the conditions are met here, except that ?hgnc_id doesn't start with "http://identifiers.org/hgnc/". So it does not get added to mondo-omim-genes.robot.tsv. And therefore it doesn't get added to mondo-edit.obo.

Copy link
Collaborator Author

@twhetzel twhetzel Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: The comment about IGKC refers to this line for MONDO:0013576 and there is a HGNC identifier for this gene: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:5716

NOTE: The gene association on this line should be maintained and with the OMIM source evidence. The OMIM source provenance was removed because of the missing data mapping in the mim2gene.txt file. Joe will email OMIM to let them know they are missing the HGNC gene identifier in the mim2gene.txt file.

Copy link
Collaborator Author

@twhetzel twhetzel Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: The gene association for MONDO:0014160 on this line should have also been maintained. The OMIM source provenance was not added back here because of missing data for the HGNC identifier in the mim2gene.txt file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue w/ the data that Trish and I are discussing is that we see a label for the HGNC term in morbidmap.txt, but the ID for that term does not appear in mim2gene.txt. We will contact OMIM about this apparent discrepancy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also discussed a few programmatic solutions to this issue. Given the status of the project, I would like to get feedback on how to proceed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OMIM source provenance is now added back for these gene associations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these instances, I also sent an email to OMIM:

We're having an issue where we are processing mim2gene.txt and it appears there are missing HGNC symbols in the Approved Gene Symbol (HGNC) column. We are expecting to see them there because the gene has an entry in morbidmap.txt where the symbol appears.

So we're just thinking that this is a consistency problem. We are wondering if you agree; that if symbols appear in one of the files for an association, they should also appear in the other file.

I created a GitHub issue for Monarch purposes, but wanted to share it with you because it goes over an example in detail: OMIM:147200 <-> HGNC:5716 (IGKC).

There are other examples we found which also fit the same discrepancy / pattern:

  • OMIM:615387 <-> HGNC:12029 (TRAC)
  • OMIM:601495 <-> HGNC:5541 (IGHM)

And their response:

TRAC, IGKC, and IGHM are immunoglobin gene/regions. As such,
they are unusual and do not carry a standard a "gene" locus type
annotation in the browsers. We do not always have a 1-to-1 correlation
with HGNC for the immunoglobin entities. I will review these cases
and see if we can match a few more.

Of note, MIM:615387 and MIM:601495 shouldn't map to any HGNC ID.
They are phenotypes, not genes.

@@ -465430,7 +465434,7 @@ is_a: MONDO:0015977 {source="MESH:C538056", source="MONDO:Redundant", source="OM
intersection_of: MONDO:0011096 ! autosomal agammaglobulinemia
intersection_of: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/5541 ! IGHM
relationship: curated_content_resource https://search.clinicalgenome.org/kb/conditions/MONDO:0020729 {source="MONDO:CLINGEN"}
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/5541 {source="MONDO:mim2gene_medgen", source="OMIM:601495"} ! IGHM
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/5541 {source="MONDO:mim2gene_medgen"} ! IGHM
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this OMIM source should have been maintained. @joeflack4 can you look into this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the phenotype OMIM:601495 is mapped to the gene OMIM:147020 as shown in morbidmap.txt:

Agammaglobulinemia 1, 601495 (3) IGHM, MU, AGM1 147020 14q32.33

But as with my long explanation above, the reason we don't have any HGNC source evidence being added is because it only shows an NCBI gene entry in mim2gene.txt, not an HGNC one:

mim2gene.txt:
147020 gene 3507 ENSG00000211899

Which gets added like this in omim.owl:
`AnnotationAssertion(http://www.w3.org/2004/02/skos/core#exactMatch https://omim.org/entry/147020 https://www.ncbi.nlm.nih.gov/gene/3507)

Copy link
Collaborator Author

@twhetzel twhetzel Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the issue with the data, the OMIM source provenance for MONDO:0020729 on this line should have also been kept.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue w/ the data that Trish and I are discussing is that we see a label for the HGNC term in morbidmap.txt, but the ID for that term does not appear in mim2gene.txt. We will contact OMIM about this apparent discrepancy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OMIM source provenance was added back to this gene association.

relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/2979 {source="OMIM:619478"} ! DNMT3B
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/50800 {source="OMIM:619478"} ! DUX4
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/2979 ! DNMT3B
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/50800 ! DUX4
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disease contains "digenic" in the description and the gene associations should be removed (not just the source provenance).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per explanation here, just because it has digenic in the label doesn't necessarily mean it will be removed. Though in most cases it will. However it'll be removed because there are >1 associations, not because 'digenic' is in the label.

Regarding removing the entire association, this is part of the mondo pipeline that Nico created. I assume you or I will edit this to remove the association when no evidence remains? I can do that if you want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MONDO:0030355 is in the exclusion file so these gene associations should not have existed in mondo-edit.obo. These gene associations should be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gene association was removed.

relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/253 {source="OMIM:619151"} ! ADH5
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/404 {source="OMIM:619151"} ! ALDH2
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/253 ! ADH5
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/404 ! ALDH2
Copy link
Collaborator Author

@twhetzel twhetzel Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disease description contains "digenic" for OMIM:619151 and these gene associations should be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: response

Copy link
Collaborator Author

@twhetzel twhetzel Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (MONDO:0030894) is in the exclusion file and the gene association should be removed.

relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/12441 {source="OMIM:620040"} ! TYMS
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/30365 {source="OMIM:620040"} ! ENOSF1
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/12441 ! TYMS
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/30365 ! ENOSF1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disease description contains "digenic" for https://omim.org/entry/620040 and these gene associations should be removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: response

Copy link
Collaborator Author

@twhetzel twhetzel Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (MONDO:0031057) is in the exclusion file and the gene association should be removed.

@twhetzel
Copy link
Collaborator Author

This PR ready for re-review. Additional tickets based on the issues found here will be added.

relationship: excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/10593 {source="OMIM:108770"} ! SCN5A
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/4279 {source="MONDO:mim2gene_medgen", source="OMIM:108770"} ! GJA5
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn since these gene associations do not exist in master, do you know how the OMIM pipeline added these in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tehy are not in master, it means this branch is behind master and needs to be rebased.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a local test last night with a fresh pull from mondo master and created a new branch and ran the OMIM gene update as sh run.sh make update-omim-genes -B and the same thing happened.

Screenshot 2025-01-30 at 8 07 49 AM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This term has the annotation
excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql

These gene annotations should not be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sabrinatoro can you remind me what your expectation is for entries that you said should be in the exclusion file?? I thought these were Mondo classes where OMIM source provenance should not be added and therefore the gene association(s) should not be on the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn you mentioned in the first review "I thought we decided not to preserve gene references if there is no provenance at all? This goes for various examples in this PR."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sabrinatoro the exclusion file is from the spreadsheet that you created and mentioned in the OMIM gene update in Nov.

That PR does not have any gene associations for MONDO:0015281 'atrial standstill'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR does not have any gene associations for MONDO:0015281 'atrial standstill'.

@twhetzel I don't follow...
This PR show that the gene annotation (2 genes) are removed from atrial standstill 1 - MONDO:0007171
This should not happen.
Based on the notes from November "I also started this spreadsheet to keep track of the record that should be excluded from the pipeline because they do not fit the rules."
To me, it means that nothing should happen to this term, ie no gene removed, no gene added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to look at all the commits from the previous PR and the last two you did and added the exclude from qc checks were not originally showing up and making this a very confusing picture. Now that I am aware of those later commits, I hope that clear things up 🙏

Copy link
Collaborator Author

@twhetzel twhetzel Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will manually add these back and @joeflack4 will update the OMIM ingest code to "protect" these so that nothing happens to this term, ie no gene removed, no gene added.

Copy link
Collaborator

@sabrinatoro sabrinatoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments.

relationship: excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/10593 {source="OMIM:108770"} ! SCN5A
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/4279 {source="MONDO:mim2gene_medgen", source="OMIM:108770"} ! GJA5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This term has the annotation
excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql

These gene annotations should not be removed.

relationship: excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/29090 {source="MONDO:mim2gene_medgen", source="OMIM:158901"} ! SMCHD1
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/50800 {source="OMIM:158901"} ! DUX4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This term has the annotation
excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql

These gene annotations should not be removed.

@@ -500399,8 +500392,6 @@ xref: OMIM:619478 {source="MONDO:equivalentTo"}
xref: UMLS:C5561960 {source="MEDGEN:1794170", source="MONDO:equivalentTo", source="MONDO:MEDGEN"}
is_a: MONDO:0001347 {source="DOID:0060918", source="OMIM:619478"} ! facioscapulohumeral muscular dystrophy
relationship: excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/2979 {source="OMIM:619478"} ! DNMT3B
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/50800 {source="OMIM:619478"} ! DUX4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This term has the annotation
excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql

These gene annotations should not be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the presence of the http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql means necessarily, that this is a disease which is allowed to have multiple genes as causes - this will make our pipelines a little more complex..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql means a curator reviewed this and agreed that this disease has more than 1 disease defining gene associations

@@ -503207,8 +503198,6 @@ xref: Orphanet:611216 {source="MONDO:equivalentTo"}
xref: UMLS:C5436906 {source="MONDO:equivalentTo", source="MONDO:MEDGEN", source="MEDGEN:1754257"}
is_a: MONDO:0000159 {source="OMIM:619151"} ! bone marrow failure syndrome
relationship: excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/253 {source="OMIM:619151"} ! ADH5
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/404 {source="OMIM:619151"} ! ALDH2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This term has the annotation
excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql

These gene annotations should not be removed.

@@ -505014,8 +505003,6 @@ xref: OMIM:620040 {source="MONDO:equivalentTo"}
xref: UMLS:C5774217 {source="MEDGEN:1823990", source="MONDO:equivalentTo", source="MONDO:MEDGEN"}
is_a: MONDO:0015780 {source="OMIM:620040"} ! dyskeratosis congenita
relationship: excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/12441 {source="OMIM:620040"} ! TYMS
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/30365 {source="OMIM:620040"} ! ENOSF1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This term has the annotation
excluded_from_qc_check http://purl.obolibrary.org/obo/mondo/sparql/qc/mondo/qc-multiple-gene-associations.sparql

These gene annotations should not be removed.

src/ontology/mondo-edit.obo Outdated Show resolved Hide resolved
@@ -175530,6 +175521,7 @@ is_a: MONDO:0005138 {source="DOID:5409", source="EFO:0000702", source="MONDO:Red
is_a: MONDO:0005454 {source="MONDO:Redundant", source="NCIT:C4917/inferred", source="ONCOTREE:SCLC"} ! lung neuroendocrine neoplasm
intersection_of: MONDO:0000402 ! small cell carcinoma
intersection_of: disease_has_location UBERON:0002048 ! lung
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/9884 {source="OMIM:182280"} ! RB1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to omim, this term has a % in front of the name meaning that "Phenotype description or locus, molecular basis unknown"
Screenshot 2025-01-30 at 8 29 15 AM

Please confirm that this annotation should be added according to our rules and because it is in the file we used.
If it is, then the workflow works correctly, and I will remove it manually (as it is an error).

Copy link
Collaborator

@joeflack4 joeflack4 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the logic we set up, we allow addition of these associations regardless of the MIM type (#, %, etc).

If you want, it is simple for me to add logic such that we only allow additions of associations if the disease MIM is of type # (Phenotype).
I think this would be easy and better than handling these manually each time.

FYI regarding how unexpected types are currently handled:
We do however, "flag" cases where the MIM type is unexpected. We expect that the MIM type will be # (phenotype) or % (Phenotype description or locus, molecular basis unknown). By flag, I mean that if an association meets all of the conditions for disease-defining but isn't a # or %, an entry will get added to review.tsv in the release (documentation). Note that there have been no instances of a non # or % being added, though.

@@ -165662,6 +165652,7 @@ xref: SCTID:721307000 {source="MONDO:equivalentTo"}
xref: UMLS:C1834582 {source="MEDGEN:331782", source="MONDO:equivalentTo", source="MONDO:MEDGEN"}
is_a: MONDO:0020076 {source="DOID:0060888", source="Orphanet:420611"} ! myeloproliferative neoplasm
relationship: disease_arises_from_feature MONDO:0008608 ! Down syndrome
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/4170 {source="OMIM:159595"} ! GATA1
Copy link
Collaborator

@sabrinatoro sabrinatoro Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as below.
If this annotation is expected as per our rules and the file we use, then all is well.
It should however be removed manually as it is not correct according to the website.

Copy link
Collaborator Author

@twhetzel twhetzel Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see this is for MONDO:0008040 'transient myeloproliferative syndrome' and that the OMIM page shows this with a mapping key of 2. @joeflack4 mentioned this has 2 rows in the morbidmap.txt file and is found to be self-referential and somatic so for these reasons it is added into the review.txt file and needed curator review to decide whether the gene association should be added.

@twhetzel twhetzel assigned sabrinatoro and unassigned twhetzel Jan 30, 2025
Copy link
Collaborator

@sabrinatoro sabrinatoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good! I am approving this PR.

@twhetzel, feel free to merge, but first, please see my comment regarding the SOP for annotations that should not be updated. Thank you.

@@ -225590,6 +225591,7 @@ xref: OMIM:278850 {source="MONDO:equivalentTo"}
xref: Orphanet:393 {source="OMIM:278850"}
xref: UMLS:C2749215 {source="MONDO:equivalentTo", source="MONDO:MEDGEN", source="MEDGEN:411414"}
is_a: MONDO:0100249 {source="DOID:0111763", source="Orphanet:393/btnt"} ! 46,XX testicular disorder of sex development
relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/11204 {source="OMIM:278850"} ! SOX9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have this relationship to the gene because
"(...) is caused by heterozygous duplication or triplication of a 68-kb regulatory region (XXSR) -584 to -516 kb upstream of the SOX9 gene on chromosome 17q24."

However, one would not be able to see without reading the details. (ie this has nothing to do with this workflow).

I will manually remove it in a follow-up PR (I have something else to update).

@twhetzel Could you please confirm: Should I add this Mondo/OMIM to the "excluded spreadsheet", so it will not be updated next time? What is the SOP? Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, there will be another file that can be updated by curators for gene associations that should not be added even though they otherwise meet the rules. Joe is working on this now.

@sabrinatoro sabrinatoro merged commit 20ed165 into master Jan 30, 2025
1 check passed
@sabrinatoro sabrinatoro deleted the omim_gene_references-bdecdc6 branch January 30, 2025 23:57
sabrinatoro added a commit to monarch-initiative/omim that referenced this pull request Jan 31, 2025
This update is based on [this PR](monarch-initiative/mondo#8624)

NOTE: SOP is needed surrounding this exclusion document.
(In addition, we need to create a standard for the exclusion reason).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants