Add harmonization of disease terms with MONDO IDs by theferrit32 · Pull Request #126 · clingen-data-model/cur-ai-ss

theferrit32 · 2026-06-03T16:12:13Z

MONDO_LINKING task is enqueued after paper metadata extraction and patient variant occurrence extraction.
handle_mondo_linking() snapshots the paper disease name plus occurrence disease names, builds paper/occurrence context, trims blanks, and dedupes disease strings so each distinct name is resolved once.
Each disease string goes through find_mondo_term_for_disease_with_agent().
First pass is deterministic lookup against the in-memory MONDO index: direct MONDO ID, primary label, exact synonym, xref/external ID, related synonym, broad/narrow synonym, abbreviation, deprecated replacement.
Deterministic lookup only returns a match when a step obtains a unique match to one MONDO ID. Ambiguous exact matches are recorded as context, not guessed.
If deterministic lookup succeeds, the agent is skipped and the term is returned immediately.
If deterministic lookup does not select a term, retrieve_mondo_candidates() builds a RapidFuzz-ranked candidate list from MONDO labels/synonyms.
The MONDO agent receives the disease text, paper/occurrence context, strict ambiguity context, and initial candidates. It can also call tools to search candidates again, inspect a term, check parents/children, or lookup xrefs.
The agent either selects a valid MONDO ID or returns null. Selected IDs are verified locally before being saved.

…eted as MONDO

theferrit32 · 2026-06-03T16:33:49Z

The code goes through 2 main phases, it tries to first see if the disease term extracted from the paper has an exact match in MONDO, and it iterates through primary_label, exact_synonym, xref, related_synonym, broad_narrow_synonym, abbreviation, deprecated_replacement, in order, exiting early if a unique match is found.

If there's no exact match, it then collects fuzzy matches using rapidfuzz and provides those to an agent to judge. The agent also has access to tool functions to do further navigation/searches if it wants.

theferrit32 · 2026-06-03T16:35:12Z

In the event an exact match is not found and the agent takes over, it will persist reasoning context in the DB. The attached file is an example.
pmid-35712613-gene-TRPM7.mondo-linking-paper.json

bpblanken · 2026-06-03T16:40:51Z

    SQLLITE_DIR: str = 'sqllite'
    EXTRACTED_PDF_DIR: str = 'extracted_pdfs'
    REFERENCE_DATA_DIR: str = 'reference_data'
+    MONDO_ONTOLOGY_URL: str = 'https://purl.obolibrary.org/obo/mondo.json'


Will this change depending on environment? Maybe just put it in the mondo as a constant?

Also, the link does not curl for me, is there a chance they're blocking bots?

I doubt that URL would change but I guess there may be a use case where you want to direct it to a static set of reference data in a bucket or something, rather than downloading from the obolibrary server.

That URL is like their public facing one and it has a redirect to the actual location which right now is in github.

This should work (and requests.get has redirect-following enabled by default)

curl -L -O https://purl.obolibrary.org/obo/mondo.json

bpblanken · 2026-06-03T17:07:13Z

+
+def ontology_url() -> str:
+    """Return the configured MONDO ontology download URL."""
+    return getattr(env, 'MONDO_ONTOLOGY_URL', MONDO_ONTOLOGY_ENDPOINT)


this block looks weird, we can just env.MONDO_ONTOLOGY_URL I think (looking at you Claude!)

bpblanken · 2026-06-03T17:14:19Z

+    match_context: dict[str, Any] | None = None
+
+
+class MondoDiseaseContext(BaseModel):


I think we can likely do without this, the identifiers and other metadata are already present in the db. if we're providing additional context to the Agent during matching, we want to give the full paper.

…ll always just be the input text from the paper

bpblanken · 2026-06-03T18:04:27Z


 setup_logging()
 logger = logging.getLogger(__name__)
+_find_mondo_term_for_disease = find_mondo_term_for_disease_with_agent


this line looks off! I've tended to put the Runner.run calls in this file, but the agent definitions elsewhere. Maybe lifting the "if deterministic match: return early else: call agent" logic into this file is appropriate?

Ah I think this specific line is a holdover from earlier development when I had different strategies in other modules, and wanted to be able to switch between them.

find_mondo_term_for_disease_with_agent does both the initial exact match searching and the agent adjudication of ambiguous/fuzzy matches. I could lift that logic into handle_mondo_linking here, having it call into the exact match function, and then execute the agent with the ambiguous/fuzzy matches. I'll see how much that bloats handlers.py.

It's not too bad

But the file is getting pretty long in general

bpblanken · 2026-06-03T19:00:52Z

+    """Return an exact MONDO match or ambiguity context from in-memory indexes."""
+    strict_ambiguities: list[dict[str, Any]] = []
+
+    direct_matches = [


Rather than using a regex for this, I'd recommend having the agent try to do the deterministic mapping.

"If there is a mondo id embedded in the extracted disease string, call the "get_mondo_term" tool directly"

We're generally having much better success with "hey agent normalize this string and try it multiple times", than "write me a complicated parser to extract a piece of text" even if the latter is theoretically deterministic.

bpblanken · 2026-06-03T19:36:16Z

+    return keys
+
+
+def normalize_identifier_keys(identifier: str) -> set[str]:


Some thoughts on this function and higher level design:

We should probably only care about CURIE's in the index, so naming this url_to_curie and not keeping URLs seems cleaner. I think we don't want to deal with URLs coming out of papers in any capacity, so just supporting lookups based on identifiers seems better.

Having a secondary data structure that points non-MONDO CURIEs to a MONDO id (which contains the hierarchy and disease names) seems reasonable? with different Agent tool?

bpblanken · 2026-06-03T20:00:59Z

one more high level thought, what do you think about something like:

@dataclass(frozen=True)
  class MondoRecord:
      """A disease term from the MONDO ontology."""

      # Canonical MONDO identifier (e.g., "MONDO:0007947")
      mondo_id: str

      # Primary disease name (e.g., "Marfan syndrome")
      label: str

      # Textual definition of the disease
      definition: str | None = None

      # All disease names: primary label + synonyms + abbreviations
      # Examples: ["Marfan syndrome", "Marfan's syndrome", "MFS", "Marfan syndrome type 1"]
      # Used for fuzzy matching against raw disease text from papers
      aliases: list[str] = field(default_factory=list)


  @dataclass
  class MondoIndex:
      """In-memory index of the MONDO ontology for disease matching."""

      # Core data: MONDO ID → disease term with metadata
      records: dict[str, MondoRecord]

      # External identifier mappings (OMIM, Orphanet, DOID, UMLS, etc.)
      # Example: "MONDO:0007947" → ["OMIM:154700", "Orphanet:558", "DOID:14323"]
      # Used by agent tool: get_mondo_terms_by_xref()
      xrefs_by_mondo_id: dict[str, list[str]]

      # Ontology hierarchy: MONDO ID → parent terms (more general diseases)
      # Used by agent tool: get_mondo_parents()
      parent_edges_by_mondo_id: dict[str, list[dict[str, Any]]]

      # Ontology hierarchy: MONDO ID → child terms (more specific diseases)
      # Used by agent tool: get_mondo_children()
      child_edges_by_mondo_id: dict[str, list[dict[str, Any]]]


  def retrieve_mondo_candidates(
      index: MondoIndex,
      query: str,
      limit: int = 20,
  ) -> list[MondoCandidate]:
      """Return RapidFuzz-ranked candidates for a disease name query.

      Fuzzy matches against both disease name aliases (labels, synonyms) and
      external identifiers (OMIM, Orphanet, etc.). Exact identifier matches
      naturally rank highest.

      Args:
          index: The MONDO ontology index
          query: Raw disease text from paper (e.g., "Marfan syndrome" or "OMIM:154700")
          limit: Maximum number of candidates to return

      Returns:
          List of MondoCandidate ranked by relevance, most relevant first
      """
      all_terms = []
      term_to_mondo_id: dict[str, str] = {}

      # Collect disease name aliases (labels + synonyms)
      for record in index.records.values():
          for alias in record.aliases:
              all_terms.append(alias)
              term_to_mondo_id[alias] = record.mondo_id

      # Collect external identifiers
      for mondo_id, xrefs in index.xrefs_by_mondo_id.items():
          for xref in xrefs:
              all_terms.append(xref)
              term_to_mondo_id[xref] = mondo_id

      # Fuzzy rank all terms by similarity to query
      ranked = process.extract(query, all_terms, scorer=fuzz.token_sort_ratio, limit=limit)

      # Build candidates, deduping by MONDO ID
      candidates = []
      seen_mondo_ids = set()
      for term, score, _ in ranked:
          mondo_id = term_to_mondo_id[term]
          if mondo_id in seen_mondo_ids:
              continue
          seen_mondo_ids.add(mondo_id)

          record = index.records[mondo_id]
          candidates.append(MondoCandidate(
              mondo_id=record.mondo_id,
              label=record.label,
              matched_alias_text=term,
              retrieval_source='fuzzy',
              rapidfuzz_score=float(score),
              definition=record.definition,
          ))

      return candidates

basically fuzzy matching on any of the synonyms or xref identifiers?

…om handle_mondo_linking

…ranking and take the highest per mondo id.

theferrit32 · 2026-06-04T22:59:57Z

Pushing some more commits related to feedback scoped to this code, and a little more simplification.

But I am going to see if codex can take this branch state and then do a deep refactor to see how much the agent itself can do with tools, do rather than having so much of it explicit in code.

theferrit32 added 12 commits June 3, 2026 10:47

Initial pass at MONDO harmonization skeleton

d7a0d2c

Implement MONDO disease matching

946d3a9

Add docstrings describing matching steps

aa1436f

Add fuzzy score cutoff, skip blanks, ensure other ids are not interpr…

7044acb

…eted as MONDO

Add docstrings

fe07983

Add SQLite-backed MONDO linking save point

a98bf5f

Revert back to the in-memory MONDO matching

1da4f67

Remove MONDO SQLite save-point artifacts

42fb2c4

Trim MONDO PR artifacts

ee6af2b

Clear stale MONDO fields on disease edits

d2a9fb3

Share MONDO RapidFuzz candidate ranking

15e5f8f

Remove unused MONDO non-agent matcher

b6bb98c

fmt

95cf5fa

bpblanken reviewed Jun 3, 2026

View reviewed changes

Comment thread lib/tasks/handlers.py

Merge remote-tracking branch 'origin/main' into kf/122-disease-ids

5a3e31d

bpblanken reviewed Jun 3, 2026

View reviewed changes

Fix migration chain and combine the mondo ones

f78d45f

bpblanken reviewed Jun 3, 2026

View reviewed changes

Comment thread lib/reference_data/mondo.py Outdated

theferrit32 added 2 commits June 3, 2026 13:26

Remove unnecessary confidence_score from mondo linking agent

6549393

Remove unnecessary 'matched_text' field from mondo linking agent. It'…

ae152ca

…ll always just be the input text from the paper

bpblanken reviewed Jun 3, 2026

View reviewed changes

theferrit32 added 2 commits June 3, 2026 16:34

Scope MONDO linking tasks to occurrences

b65a446

Refactor hardcoded match type string into MondoMatchType

e19ea85

theferrit32 added 4 commits June 3, 2026 17:59

Refactor and separate MONDO exact and agent-based linking and call fr…

91a063a

…om handle_mondo_linking

WIP: Simplify some lookup and indexing. Update naming and docstrings.

faa7d61

Remove fuzzy_sort_key and related logic. We will trust the rapidfuzz …

7d71cdb

…ranking and take the highest per mondo id.

Add build_mondo_agent_message and delete match_context

4f2e866

		match_context: dict[str, Any] \| None = None


		class MondoDiseaseContext(BaseModel):

		return keys


		def normalize_identifier_keys(identifier: str) -> set[str]:

Conversation

theferrit32 commented Jun 3, 2026

Uh oh!

theferrit32 commented Jun 3, 2026

Uh oh!

theferrit32 commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpblanken commented Jun 3, 2026

Uh oh!

theferrit32 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants