Add harmonization of disease terms with MONDO IDs#126
Conversation
|
The code goes through 2 main phases, it tries to first see if the disease term extracted from the paper has an exact match in MONDO, and it iterates through If there's no exact match, it then collects fuzzy matches using rapidfuzz and provides those to an agent to judge. The agent also has access to tool functions to do further navigation/searches if it wants. |
|
In the event an exact match is not found and the agent takes over, it will persist reasoning context in the DB. The attached file is an example. |
| SQLLITE_DIR: str = 'sqllite' | ||
| EXTRACTED_PDF_DIR: str = 'extracted_pdfs' | ||
| REFERENCE_DATA_DIR: str = 'reference_data' | ||
| MONDO_ONTOLOGY_URL: str = 'https://purl.obolibrary.org/obo/mondo.json' |
There was a problem hiding this comment.
Will this change depending on environment? Maybe just put it in the mondo as a constant?
Also, the link does not curl for me, is there a chance they're blocking bots?
There was a problem hiding this comment.
I doubt that URL would change but I guess there may be a use case where you want to direct it to a static set of reference data in a bucket or something, rather than downloading from the obolibrary server.
That URL is like their public facing one and it has a redirect to the actual location which right now is in github.
This should work (and requests.get has redirect-following enabled by default)
curl -L -O https://purl.obolibrary.org/obo/mondo.json
|
|
||
| def ontology_url() -> str: | ||
| """Return the configured MONDO ontology download URL.""" | ||
| return getattr(env, 'MONDO_ONTOLOGY_URL', MONDO_ONTOLOGY_ENDPOINT) |
There was a problem hiding this comment.
this block looks weird, we can just env.MONDO_ONTOLOGY_URL I think (looking at you Claude!)
| match_context: dict[str, Any] | None = None | ||
|
|
||
|
|
||
| class MondoDiseaseContext(BaseModel): |
There was a problem hiding this comment.
I think we can likely do without this, the identifiers and other metadata are already present in the db. if we're providing additional context to the Agent during matching, we want to give the full paper.
…ll always just be the input text from the paper
|
|
||
| setup_logging() | ||
| logger = logging.getLogger(__name__) | ||
| _find_mondo_term_for_disease = find_mondo_term_for_disease_with_agent |
There was a problem hiding this comment.
this line looks off! I've tended to put the Runner.run calls in this file, but the agent definitions elsewhere. Maybe lifting the "if deterministic match: return early else: call agent" logic into this file is appropriate?
There was a problem hiding this comment.
Ah I think this specific line is a holdover from earlier development when I had different strategies in other modules, and wanted to be able to switch between them.
find_mondo_term_for_disease_with_agent does both the initial exact match searching and the agent adjudication of ambiguous/fuzzy matches. I could lift that logic into handle_mondo_linking here, having it call into the exact match function, and then execute the agent with the ambiguous/fuzzy matches. I'll see how much that bloats handlers.py.
There was a problem hiding this comment.
It's not too bad
There was a problem hiding this comment.
But the file is getting pretty long in general
| """Return an exact MONDO match or ambiguity context from in-memory indexes.""" | ||
| strict_ambiguities: list[dict[str, Any]] = [] | ||
|
|
||
| direct_matches = [ |
There was a problem hiding this comment.
Rather than using a regex for this, I'd recommend having the agent try to do the deterministic mapping.
"If there is a mondo id embedded in the extracted disease string, call the "get_mondo_term" tool directly"
There was a problem hiding this comment.
We're generally having much better success with "hey agent normalize this string and try it multiple times", than "write me a complicated parser to extract a piece of text" even if the latter is theoretically deterministic.
| return keys | ||
|
|
||
|
|
||
| def normalize_identifier_keys(identifier: str) -> set[str]: |
There was a problem hiding this comment.
Some thoughts on this function and higher level design:
- We should probably only care about CURIE's in the index, so naming this
url_to_curieand not keeping URLs seems cleaner. I think we don't want to deal with URLs coming out of papers in any capacity, so just supporting lookups based on identifiers seems better. - Having a secondary data structure that points non-MONDO CURIEs to a MONDO id (which contains the hierarchy and disease names) seems reasonable? with different Agent tool?
|
one more high level thought, what do you think about something like: basically fuzzy matching on any of the synonyms or xref identifiers? |
…om handle_mondo_linking
…ranking and take the highest per mondo id.
|
Pushing some more commits related to feedback scoped to this code, and a little more simplification. But I am going to see if codex can take this branch state and then do a deep refactor to see how much the agent itself can do with tools, do rather than having so much of it explicit in code. |
MONDO_LINKING task is enqueued after paper metadata extraction and patient variant occurrence extraction.
handle_mondo_linking() snapshots the paper disease name plus occurrence disease names, builds paper/occurrence context, trims blanks, and dedupes disease strings so each distinct name is resolved once.
Each disease string goes through find_mondo_term_for_disease_with_agent().
First pass is deterministic lookup against the in-memory MONDO index: direct MONDO ID, primary label, exact synonym, xref/external ID, related synonym, broad/narrow synonym, abbreviation, deprecated replacement.
Deterministic lookup only returns a match when a step obtains a unique match to one MONDO ID. Ambiguous exact matches are recorded as context, not guessed.
If deterministic lookup succeeds, the agent is skipped and the term is returned immediately.
If deterministic lookup does not select a term, retrieve_mondo_candidates() builds a RapidFuzz-ranked candidate list from MONDO labels/synonyms.
The MONDO agent receives the disease text, paper/occurrence context, strict ambiguity context, and initial candidates. It can also call tools to search candidates again, inspect a term, check parents/children, or lookup xrefs.
The agent either selects a valid MONDO ID or returns null. Selected IDs are verified locally before being saved.