Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synonym Sync: qc-duplicate-exact-synonym-no-abbrev related updates && Remove unused files #767

Draft
wants to merge 20 commits into
base: develop
Choose a base branch
from

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Feb 7, 2025

Overview

Further updates to scripting that prevents multiple exact synonyms from appearing on different Mondo IDs.

Most important change:

  • Now no longer filters any such conflicts from -updated templates.

Other changes:

  • Now doesn't mutate inputs. Inputs/outputs are different files.
  • Handled edge cases where if there were no conflicts of a certain type, would get error.
  • Docs for reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv into reports/README.md

Pre-merge checklist

Documentation

Was the documentation added/updated under docs/?

  • Yes
  • No, updates to the docs were not necessary after careful consideration

QC

Was the full pipeline run before submitting this PR using sh run.sh make build-mondo-ingest on this branch (after
docker pull obolibrary/odkfull:dev), and no errors occurred?

  • Yes
  • No, there are no functional (code-related) changes to the pipeline in the PR, so no re-run is necessary

Mini builds:

Builds:

New Packages

Were any new Python packages added?

Were any other non-Python packages added?

PR Review and Conversations Resolved

Has the PR been sufficiently reviewed by at least 1 team member of the Mondo Technical team and all threads resolved?

  • Yes

Additional notes

- Remove the combined cases file
- Update the source-specific files so that they now go in tmp/
- Further updates to scripting that prevents multiple exact synonyms from appearing on different Mondo IDs.
- Bug fix: synonyms-scope-type-xref(.sparql/.tsv) was missing cls_labe. This was causing a KeyError. Not sure how the code was running successfully without this before.
- Update: No longer filtering -updated.
@joeflack4 joeflack4 requested a review from twhetzel February 7, 2025 02:25
@joeflack4 joeflack4 changed the base branch from main to develop February 7, 2025 02:25
@joeflack4 joeflack4 self-assigned this Feb 7, 2025
@joeflack4 joeflack4 added the enhancement New feature or request label Feb 7, 2025
@joeflack4 joeflack4 marked this pull request as draft February 7, 2025 03:08
- Update: Now has different paths for input/output files, instead of mutating the inputs.
- Delete: Code for backing up the inputs. No longer needed, as there are no longer mutations.
- Add: Docs for reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv
- Update: Finished updating parameterization regrarding change of input/output dirs
  - Note that this update also gets fixes the spurious / confusing diffs issue
- Update: Added logic to handle errors in case of no issues detected
@joeflack4 joeflack4 marked this pull request as ready for review February 8, 2025 02:57
@joeflack4 joeflack4 changed the title qc-duplicate-exact-synonym-no-abbrev related upates qc-duplicate-exact-synonym-no-abbrev related updates Feb 8, 2025
twhetzel

This comment was marked as outdated.

@joeflack4 joeflack4 marked this pull request as draft February 9, 2025 00:26
@joeflack4 joeflack4 marked this pull request as ready for review February 9, 2025 00:27
@joeflack4 joeflack4 changed the title qc-duplicate-exact-synonym-no-abbrev related updates Synonym Sync: qc-duplicate-exact-synonym-no-abbrev related updates && Remove unused files Feb 9, 2025
Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments inline


### 7. `reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv`
**What this file represents**
This file shows cases that were filtered out of the synonym sync because they caused conflicts, identified by qc-duplicate-exact-synonym-no-abbrev.sparql.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This report file should contain any synonym generated by the Synonym Sync pipeline that is considered an "-added" synonym that if the synonym were added to Mondo would result in a Mondo QC check error due to the qc-duplicate-exact-synonym-no-abbrev.sparql check and the current Mondo synonym the conflict is with. Are only the synonyms in the "-added" file being "filtered out of the synonym sync" and in this report file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twhetzel You're correct. I did make that change, but when creating the documentation, I copy/pasted from the spreadsheet.

Just updated this and pushed a new commit. New text is:

This file shows cases -added that were filtered out of the synonym sync because they caused conflicts, identified by qc-duplicate-exact-synonym-no-abbrev.sparql in the mondo repo.

**Columns**
- `synonym`
- `mondo_id`: The Mondo term that is getting affected by an -added or -updated synonym change. If the 'case' for the row
is -confirmed or -unconfirmed, then this synonym already exists in that Mondo term.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file and overall code changes should only be applied for "-added" synonym sync cases (when compared to "-confirmed"). The "-updated" cases should not be considered here at this time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to remove -updated!

- `mondo_id`: The Mondo term that is getting affected by an -added or -updated synonym change. If the 'case' for the row
is -confirmed or -unconfirmed, then this synonym already exists in that Mondo term.
- `source_id`: The source term ID that the synonym is coming from. In the case that a synonym appears in multiple
sources, there will be multiple rows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file format can simply be the same as what is in the QC check failure for now:

MONDO:0009151-MONDO:0700251,oboInOwl:hasExactSynonym-oboInOwl:hasExactSynonym,Zlotogora-Ogur syndrome
MONDO:0700251-MONDO:0009151,oboInOwl:hasExactSynonym-oboInOwl:hasExactSynonym,Zlotogora-Ogur syndrome

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's not make this change. I think this will not be quick, and I'm also not confident that this is a better format.

The current format is very informative and more granular, I think, and better displays cases where the synonym appears in more than 2 places. E.g. can see in the previous version of this file (spreadsheet), at the top, that the synonym 2-hydroxyglutaric aciduria has 5 rows, 2 that will be filtered out (-added), as well as 4 rows where the conflicts occur (confirmed/unconfirmed).

| 3C syndrome | MONDO:0019078 | GARD:0005666 | confirmed |
| 3C syndrome | MONDO:0019078 | Orphanet:7 | confirmed |

The conflict here is in the first of these rows. The synonym sync wants to update the synonym scope on MONDO:0009073, as evidenced by OMIM:220210. However, in changing it to exactSynonym, there would be a conflict, because that synonym already exists on MONDO:0019078, where it is evidenced by GARD:0005666 and Orphanet:7.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the code should only be checking between "-added" and "-confirmed", there should be no reports from updated at this time.

Copy link
Contributor Author

@joeflack4 joeflack4 Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll update the example now!

I changed the case example to 'added and:

The conflict here is in the first of these rows. The synonym sync wants to add this synonym to MONDO:0009073, as evidenced by OMIM:220210. However this introduces a conflict because that synonym already exists on MONDO:0019078, where it is evidenced by GARD:0005666 and Orphanet:7.

--outpath-confirmed $(SYN_SYNC_DIR)/$*.synonyms.confirmed.robot.tsv \
--outpath-updated $(SYN_SYNC_DIR)/$*.synonyms.updated.robot.tsv \
--outpath-combined $(TMPDIR)/synonym_sync_combined_cases_$*.tsv
--outpath-added $(TMPDIR)/$*.synonyms.added.robot.tsv \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #773, the "added" file contains row 1 starting with mondo_id and on row 2 starting with ID. However, when I open this file in Excel to check the data, the only column with data and a row 2 column header (ie ROBOT column header) is ID. It looks like the tabs are not correct in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This is big, thanks for the catch. I'm not sure what is up with these files being incorrect. I'll look into it now.

Copy link
Contributor Author

@joeflack4 joeflack4 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fixes

So, I have fixed a few bugs here. But let me set up some context.

This is a follow-up of:

Somehow, builds ran off of that, we reviewed it, and yet several issues slipped in:

These things are now fixed, and there is a new mini-build:

I think this PR is doing what it needs to now. But... (see below)

Problems validating the results via diffs

It's important to note that the mini build shows the changes in these files from after #751 until now. But since #751 was bugged, I don't know how valuable this diff will be. It would also be nice to look at a diff between the state of develop before #751 and compare it to this PR, but because of columnar differences, I don't think the diff will be helpful.

Sigh This whole matter is further complicated by the fact that I was running mini builds for this PR and seeing confusing diffs. In the process, I discovered and fixed a sorting error:

If we want to see the results of this PR in diff form, we should merge #777 first into develop, then I can merge it into here.

But that's not all. I'm still seeing issues of indeterministic output with the synonym sync, introduced months ago by these case variation columns. The synonym_case_diff_mondo column sometimes gets populated correctly, and other times does not. Can observe in the screenshots below.

Screenshots

These screenshots are DiffMerge. This is the result of me running the synonym sync back-to-back with no code changes, but observing diff in the outputs.

diff1 diff2

I didn't see a quick fix. Maybe there is one. But if not, it might be a better use of time to fix it effectively via:

There are several mini builds linked in the OP (that I ran at various stages), but I closed them as I'm not sure how useful the diffs currently are.

I have another normal build for this running; I will let it finish and push.

--outpath-updated $(SYN_SYNC_DIR)/$*.synonyms.updated.robot.tsv \
--outpath-combined $(TMPDIR)/synonym_sync_combined_cases_$*.tsv
--outpath-added $(TMPDIR)/$*.synonyms.added.robot.tsv \
--outpath-confirmed $(TMPDIR)/$*.synonyms.confirmed.robot.tsv \
Copy link
Contributor

@twhetzel twhetzel Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #773, the "confirmed" file contains only one header row with these column headers:

mondo_id mondo_label synonym_scope_source synonym_scope_mondo synonym synonym_case_mondo synonym_case_diff_mondo synonym_case_mondo_is_many synonym_case_source synonym_case_diff_source synonym_case_source_is_many source_id source_label synonym_type synonym_type_mondo mondo_evidence case exact_synonym exact_source_id exact_synonym_type broad_synonym broad_source_id broad_synonym_type narrow_synonym narrow_source_id narrow_synonym_type related_synonym related_source_id related_synonym_type

Why are there some many changes in the file, is there still an issue with the file sorting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script doesn't (isn't supposed to) mutate -confirmed. There was a bug I just fixed where the robot subheader was being removed.

I'm going to explain everything related to the issues with confirmed, updated, and added in 1 comment above.

--outpath-combined $(TMPDIR)/synonym_sync_combined_cases_$*.tsv
--outpath-added $(TMPDIR)/$*.synonyms.added.robot.tsv \
--outpath-confirmed $(TMPDIR)/$*.synonyms.confirmed.robot.tsv \
--outpath-updated $(TMPDIR)/$*.synonyms.updated.robot.tsv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #773, the "updated" file that contains the scope-mismatch results, also only contains one row with these column headers:

synonym synonym_scope mondo_id source_id synonym_type case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't observe this with there only being 1 row with these column headers. What I did observe with the -updated file was the same issue you described with -added. Now fixed!

- Update: Docs to reflect recent changes
- Delete: Files that are now going to be in tmp/ instead.
@joeflack4
Copy link
Contributor Author

Putting in draft mode until addressing reviewing and addressing anything that comes up in:

@joeflack4 joeflack4 marked this pull request as draft February 11, 2025 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants