Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address qc-duplicate-exact-synonym-no-abbrev failures in mondo #751

Merged
merged 10 commits into from
Feb 5, 2025

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Jan 15, 2025

Partially addresses issues in:

Overview

Filters cases where there are multiple Mondo terms with the same exactSynonym, and instead puts them into a special curation TSV.

Pre-merge checklist

Documentation

Was the documentation added/updated under docs/?

  • Yes
  • No, updates to the docs were not necessary after careful consideration

QC

Was the full pipeline run before submitting this PR using sh run.sh make build-mondo-ingest on this branch (after
docker pull obolibrary/odkfull:dev), and no errors occurred?

  • Yes
  • No, there are no functional (code-related) changes to the pipeline in the PR, so no re-run is necessary

Mini build:

Build:

New Packages

Were any new Python packages added?

Were any other non-Python packages added?

PR Review and Conversations Resolved

Has the PR been sufficiently reviewed by at least 1 team member of the Mondo Technical team and all threads resolved?

  • Yes

Additional info

Google sheet

Mondo PR using this branch:

- Filters cases where there are multiple Mondo terms with the same exactSynonym, and instead puts them into a special curation TSV.
- Add: reports/review-qc-duplicate-exact-synonym-no-abbrev.tsv
- Add/Update: Make goals & Python scripting to filter files & create that TSV.
- Update: Handle Mondo -unconfirmed cases too: Cases where the synonym sync finds no trace of the synonym in the source, but since we have not yet
- Update: Removed -confirmed deduping. This wasn't the correct way to handle checking against synonyms that already exist in Mondo, and -confirmed does not introduce duplication anyway. It doesn't add synonyms; only adds evidence for existing ones.
- Update: Handling cases now where a new exactSynonym coming in through the sync is equivalent to a label that exists on another Mondo term.
- Bug fix: Not all abbreviations were being filtered out
- Bug fix: Rows were showing up that were not supposed to, as a result of insufficient abbreviation information; only some rows for the given synonym were being filtered out, which resulted in some cases where there was only 1 row for the synonym showing in the sheet, or rows where the only cases that were showing up were -confirmed or -unconfirmed, which didn't make sense.
- Bug fix: Filtering obsoletes from check. It was being overzealous. It is allowable to have a conflict with an obsolete term. This was also manifsting in an issue where the only entries for a synonym in review.tsv were of -confirmed and -unconfirmed.
@joeflack4 joeflack4 mentioned this pull request Jan 21, 2025
9 tasks
Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am not entirely convinced an additional script was needed vs. incorporating this as a requirement for the synonym sync script, I am approving since the data generated in monarch-initiative/mondo#8584 looks generally as expected and the mondo-edit.obo file passes all QC steps. Nice job getting these items all sorted!

@joeflack4
Copy link
Contributor Author

Thanks!

If there's an idea of how to easily refactor this, let me know. It needs to come after the normal synonym sync, because the synonym sync, like the other pipeline, runs on each source individually. But in order to resolve this issue, we need to look at all sources in aggregate, hence the separate script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants