Skip to content

Add harmonization of disease terms with MONDO IDs#126

Draft
theferrit32 wants to merge 23 commits into
mainfrom
kf/122-disease-ids
Draft

Add harmonization of disease terms with MONDO IDs#126
theferrit32 wants to merge 23 commits into
mainfrom
kf/122-disease-ids

Conversation

@theferrit32
Copy link
Copy Markdown
Contributor

  • MONDO_LINKING task is enqueued after paper metadata extraction and patient variant occurrence extraction.

  • handle_mondo_linking() snapshots the paper disease name plus occurrence disease names, builds paper/occurrence context, trims blanks, and dedupes disease strings so each distinct name is resolved once.

  • Each disease string goes through find_mondo_term_for_disease_with_agent().

  • First pass is deterministic lookup against the in-memory MONDO index: direct MONDO ID, primary label, exact synonym, xref/external ID, related synonym, broad/narrow synonym, abbreviation, deprecated replacement.

  • Deterministic lookup only returns a match when a step obtains a unique match to one MONDO ID. Ambiguous exact matches are recorded as context, not guessed.

  • If deterministic lookup succeeds, the agent is skipped and the term is returned immediately.

  • If deterministic lookup does not select a term, retrieve_mondo_candidates() builds a RapidFuzz-ranked candidate list from MONDO labels/synonyms.

  • The MONDO agent receives the disease text, paper/occurrence context, strict ambiguity context, and initial candidates. It can also call tools to search candidates again, inspect a term, check parents/children, or lookup xrefs.

  • The agent either selects a valid MONDO ID or returns null. Selected IDs are verified locally before being saved.

@theferrit32
Copy link
Copy Markdown
Contributor Author

The code goes through 2 main phases, it tries to first see if the disease term extracted from the paper has an exact match in MONDO, and it iterates through primary_label, exact_synonym, xref, related_synonym, broad_narrow_synonym, abbreviation, deprecated_replacement, in order, exiting early if a unique match is found.

If there's no exact match, it then collects fuzzy matches using rapidfuzz and provides those to an agent to judge. The agent also has access to tool functions to do further navigation/searches if it wants.

@theferrit32
Copy link
Copy Markdown
Contributor Author

In the event an exact match is not found and the agent takes over, it will persist reasoning context in the DB. The attached file is an example.
pmid-35712613-gene-TRPM7.mondo-linking-paper.json

Comment thread lib/core/environment.py
SQLLITE_DIR: str = 'sqllite'
EXTRACTED_PDF_DIR: str = 'extracted_pdfs'
REFERENCE_DATA_DIR: str = 'reference_data'
MONDO_ONTOLOGY_URL: str = 'https://purl.obolibrary.org/obo/mondo.json'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change depending on environment? Maybe just put it in the mondo as a constant?

Also, the link does not curl for me, is there a chance they're blocking bots?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that URL would change but I guess there may be a use case where you want to direct it to a static set of reference data in a bucket or something, rather than downloading from the obolibrary server.

That URL is like their public facing one and it has a redirect to the actual location which right now is in github.

This should work (and requests.get has redirect-following enabled by default)

curl -L -O https://purl.obolibrary.org/obo/mondo.json

Comment thread lib/agents/mondo_linking_agent.py Outdated
Comment thread lib/agents/mondo_linking_agent.py Outdated
Comment thread lib/tasks/misc.py Outdated
Comment thread lib/tasks/handlers.py

def ontology_url() -> str:
"""Return the configured MONDO ontology download URL."""
return getattr(env, 'MONDO_ONTOLOGY_URL', MONDO_ONTOLOGY_ENDPOINT)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block looks weird, we can just env.MONDO_ONTOLOGY_URL I think (looking at you Claude!)

match_context: dict[str, Any] | None = None


class MondoDiseaseContext(BaseModel):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can likely do without this, the identifiers and other metadata are already present in the db. if we're providing additional context to the Agent during matching, we want to give the full paper.

Comment thread lib/reference_data/mondo.py Outdated
Comment thread lib/tasks/handlers.py Outdated

setup_logging()
logger = logging.getLogger(__name__)
_find_mondo_term_for_disease = find_mondo_term_for_disease_with_agent
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line looks off! I've tended to put the Runner.run calls in this file, but the agent definitions elsewhere. Maybe lifting the "if deterministic match: return early else: call agent" logic into this file is appropriate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think this specific line is a holdover from earlier development when I had different strategies in other modules, and wanted to be able to switch between them.

find_mondo_term_for_disease_with_agent does both the initial exact match searching and the agent adjudication of ambiguous/fuzzy matches. I could lift that logic into handle_mondo_linking here, having it call into the exact match function, and then execute the agent with the ambiguous/fuzzy matches. I'll see how much that bloats handlers.py.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not too bad

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the file is getting pretty long in general

Comment thread lib/reference_data/mondo.py Outdated
"""Return an exact MONDO match or ambiguity context from in-memory indexes."""
strict_ambiguities: list[dict[str, Any]] = []

direct_matches = [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using a regex for this, I'd recommend having the agent try to do the deterministic mapping.

"If there is a mondo id embedded in the extracted disease string, call the "get_mondo_term" tool directly"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're generally having much better success with "hey agent normalize this string and try it multiple times", than "write me a complicated parser to extract a piece of text" even if the latter is theoretically deterministic.

Comment thread lib/reference_data/mondo.py Outdated
return keys


def normalize_identifier_keys(identifier: str) -> set[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts on this function and higher level design:

  • We should probably only care about CURIE's in the index, so naming this url_to_curie and not keeping URLs seems cleaner. I think we don't want to deal with URLs coming out of papers in any capacity, so just supporting lookups based on identifiers seems better.
  • Having a secondary data structure that points non-MONDO CURIEs to a MONDO id (which contains the hierarchy and disease names) seems reasonable? with different Agent tool?

@bpblanken
Copy link
Copy Markdown
Collaborator

one more high level thought, what do you think about something like:

@dataclass(frozen=True)
  class MondoRecord:
      """A disease term from the MONDO ontology."""

      # Canonical MONDO identifier (e.g., "MONDO:0007947")
      mondo_id: str

      # Primary disease name (e.g., "Marfan syndrome")
      label: str

      # Textual definition of the disease
      definition: str | None = None

      # All disease names: primary label + synonyms + abbreviations
      # Examples: ["Marfan syndrome", "Marfan's syndrome", "MFS", "Marfan syndrome type 1"]
      # Used for fuzzy matching against raw disease text from papers
      aliases: list[str] = field(default_factory=list)


  @dataclass
  class MondoIndex:
      """In-memory index of the MONDO ontology for disease matching."""

      # Core data: MONDO ID → disease term with metadata
      records: dict[str, MondoRecord]

      # External identifier mappings (OMIM, Orphanet, DOID, UMLS, etc.)
      # Example: "MONDO:0007947" → ["OMIM:154700", "Orphanet:558", "DOID:14323"]
      # Used by agent tool: get_mondo_terms_by_xref()
      xrefs_by_mondo_id: dict[str, list[str]]

      # Ontology hierarchy: MONDO ID → parent terms (more general diseases)
      # Used by agent tool: get_mondo_parents()
      parent_edges_by_mondo_id: dict[str, list[dict[str, Any]]]

      # Ontology hierarchy: MONDO ID → child terms (more specific diseases)
      # Used by agent tool: get_mondo_children()
      child_edges_by_mondo_id: dict[str, list[dict[str, Any]]]


  def retrieve_mondo_candidates(
      index: MondoIndex,
      query: str,
      limit: int = 20,
  ) -> list[MondoCandidate]:
      """Return RapidFuzz-ranked candidates for a disease name query.

      Fuzzy matches against both disease name aliases (labels, synonyms) and
      external identifiers (OMIM, Orphanet, etc.). Exact identifier matches
      naturally rank highest.

      Args:
          index: The MONDO ontology index
          query: Raw disease text from paper (e.g., "Marfan syndrome" or "OMIM:154700")
          limit: Maximum number of candidates to return

      Returns:
          List of MondoCandidate ranked by relevance, most relevant first
      """
      all_terms = []
      term_to_mondo_id: dict[str, str] = {}

      # Collect disease name aliases (labels + synonyms)
      for record in index.records.values():
          for alias in record.aliases:
              all_terms.append(alias)
              term_to_mondo_id[alias] = record.mondo_id

      # Collect external identifiers
      for mondo_id, xrefs in index.xrefs_by_mondo_id.items():
          for xref in xrefs:
              all_terms.append(xref)
              term_to_mondo_id[xref] = mondo_id

      # Fuzzy rank all terms by similarity to query
      ranked = process.extract(query, all_terms, scorer=fuzz.token_sort_ratio, limit=limit)

      # Build candidates, deduping by MONDO ID
      candidates = []
      seen_mondo_ids = set()
      for term, score, _ in ranked:
          mondo_id = term_to_mondo_id[term]
          if mondo_id in seen_mondo_ids:
              continue
          seen_mondo_ids.add(mondo_id)

          record = index.records[mondo_id]
          candidates.append(MondoCandidate(
              mondo_id=record.mondo_id,
              label=record.label,
              matched_alias_text=term,
              retrieval_source='fuzzy',
              rapidfuzz_score=float(score),
              definition=record.definition,
          ))

      return candidates

basically fuzzy matching on any of the synonyms or xref identifiers?

@theferrit32
Copy link
Copy Markdown
Contributor Author

Pushing some more commits related to feedback scoped to this code, and a little more simplification.

But I am going to see if codex can take this branch state and then do a deep refactor to see how much the agent itself can do with tools, do rather than having so much of it explicit in code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants