Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Disease-Gene: missing HGNC links #186

Closed
joeflack4 opened this issue Jan 29, 2025 · 2 comments
Closed

Bug: Disease-Gene: missing HGNC links #186

joeflack4 opened this issue Jan 29, 2025 · 2 comments
Assignees
Labels
bug Something isn't working needs discussion omim

Comments

@joeflack4
Copy link
Contributor

Overview

Recently when running the disease-gene pipeline in mondo, we noticed that a lot of disease-gene source annotations were getting unexpectedly removed.

Explanation by example

id: MONDO:0013576
-relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/5716 {source="MONDO:mim2gene_medgen", source="OMIM:614102"} ! IGKC
+relationship: has_material_basis_in_germline_mutation_in http://identifiers.org/hgnc/5716 {source="MONDO:mim2gene_medgen"} ! IGKC

The source OMIM:614102 was removed because the OMIM gene-disease pipeline did not find evidence for an association between OMIM:614102 and HGNC:5716 (IGKC). However, we were surprised, because the label for it is IGKC, and we observed that IGKC is visible in the "Gene/Locus" field of the "Phenotype-Gene Relationships" table on https://omim.org/entry/614102.

You can also see it visible in morbidmap.txt, which is the data file that represents all of these "Phenotype-Gene Relationships" tables:

Phenotype Gene/Locus And Other Related Symbols MIM Number Cyto Location
Kappa light chain deficiency, 614102 (3) IGKC, IGKCD 147200 2p11.2

So why the removal? Because pipeline is not looking for HGNC symbols or IDs in morbidmap.txt. It's looking for them in mim2gene.txt. And there is a discrepancy where even if the HGNC symbol shows up for the association morbidmap.txt, it does not always appear in the same association in mim2gene.txt:

MIM Number MIM Entry Type (see FAQ 1.3 at https://omim.org/help/faq) Entrez Gene ID (NCBI) Approved Gene Symbol (HGNC) Ensembl Gene ID (Ensembl)
147200 gene 3514 ENSG00000211592
614102 phenotype

Possible solutions

We will contact OMIM to see if there is some reason why morbidmap.txt and mim2gene.txt appear to be out of sync in this way. The best solution may involve a fix on their end.

Otherwise, we can fix this on our end in a number of ways:
a. At the end of the pipeline, run an additional SPARQL query to see if there are any associations which are missing HGNC evidence, and if so, we can add it.
b. We can do the same thing, but rather than SPARQL, do a check in Python right before adding the association.
c. Earlier in the Python pipeline, we can combine all of the HGNC associations from both morbidmap.txt and mim2gene.txt.

For any of these a-c solutions, we can utilize hgnc_complete_set.txt, which we are already downloading and using elsewhere in the pipeline. We can use the first two columns, hgnc_id (e.g. HGNC:5716), and symbol (e.g. IGKC), to map any symbols we see in morbidmap.txt to their HGNC IDs.

@joeflack4 joeflack4 self-assigned this Jan 29, 2025
@joeflack4 joeflack4 added omim bug Something isn't working needs discussion labels Jan 29, 2025
@joeflack4
Copy link
Contributor Author

@twhetzel fyi

@sabrinatoro
Copy link
Contributor

(discussed with Trish on February 3rd, 2025)

Since there are a handful of genes that are not (or not yet) in HGNC, we will ignore these gene-disease associations.

Decision: if a gene is not in HGNC, ignore it (ie do not add this gene in Mondo).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs discussion omim
Projects
None yet
Development

No branches or pull requests

2 participants