-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synonym Sync: qc-duplicate-exact-synonym-no-abbrev
related updates && Remove unused files
#767
base: develop
Are you sure you want to change the base?
Conversation
- Remove the combined cases file - Update the source-specific files so that they now go in tmp/
- Further updates to scripting that prevents multiple exact synonyms from appearing on different Mondo IDs. - Bug fix: synonyms-scope-type-xref(.sparql/.tsv) was missing cls_labe. This was causing a KeyError. Not sure how the code was running successfully without this before. - Update: No longer filtering -updated.
- Update: Now has different paths for input/output files, instead of mutating the inputs. - Delete: Code for backing up the inputs. No longer needed, as there are no longer mutations.
- Add: Docs for reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv
- Update: Finished updating parameterization regrarding change of input/output dirs - Note that this update also gets fixes the spurious / confusing diffs issue - Update: Added logic to handle errors in case of no issues detected
- Bug fix: ImportError
qc-duplicate-exact-synonym-no-abbrev
related upatesqc-duplicate-exact-synonym-no-abbrev
related updates
- Delete: The by-source combined case files as well
qc-duplicate-exact-synonym-no-abbrev
related updatesqc-duplicate-exact-synonym-no-abbrev
related updates && Remove unused files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments inline
src/ontology/reports/README.md
Outdated
|
||
### 7. `reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv` | ||
**What this file represents** | ||
This file shows cases that were filtered out of the synonym sync because they caused conflicts, identified by qc-duplicate-exact-synonym-no-abbrev.sparql. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This report file should contain any synonym generated by the Synonym Sync pipeline that is considered an "-added" synonym that if the synonym were added to Mondo would result in a Mondo QC check error due to the qc-duplicate-exact-synonym-no-abbrev.sparql
check and the current Mondo synonym the conflict is with. Are only the synonyms in the "-added" file being "filtered out of the synonym sync" and in this report file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twhetzel You're correct. I did make that change, but when creating the documentation, I copy/pasted from the spreadsheet.
Just updated this and pushed a new commit. New text is:
This file shows cases
-added
that were filtered out of the synonym sync because they caused conflicts, identified byqc-duplicate-exact-synonym-no-abbrev.sparql
in themondo
repo.
src/ontology/reports/README.md
Outdated
**Columns** | ||
- `synonym` | ||
- `mondo_id`: The Mondo term that is getting affected by an -added or -updated synonym change. If the 'case' for the row | ||
is -confirmed or -unconfirmed, then this synonym already exists in that Mondo term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file and overall code changes should only be applied for "-added" synonym sync cases (when compared to "-confirmed"). The "-updated" cases should not be considered here at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to remove -updated
!
- `mondo_id`: The Mondo term that is getting affected by an -added or -updated synonym change. If the 'case' for the row | ||
is -confirmed or -unconfirmed, then this synonym already exists in that Mondo term. | ||
- `source_id`: The source term ID that the synonym is coming from. In the case that a synonym appears in multiple | ||
sources, there will be multiple rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file format can simply be the same as what is in the QC check failure for now:
MONDO:0009151-MONDO:0700251,oboInOwl:hasExactSynonym-oboInOwl:hasExactSynonym,Zlotogora-Ogur syndrome
MONDO:0700251-MONDO:0009151,oboInOwl:hasExactSynonym-oboInOwl:hasExactSynonym,Zlotogora-Ogur syndrome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think let's not make this change. I think this will not be quick, and I'm also not confident that this is a better format.
The current format is very informative and more granular, I think, and better displays cases where the synonym appears in more than 2 places. E.g. can see in the previous version of this file (spreadsheet), at the top, that the synonym 2-hydroxyglutaric aciduria has 5 rows, 2 that will be filtered out (-added
), as well as 4 rows where the conflicts occur (confirmed/unconfirmed).
src/ontology/reports/README.md
Outdated
| 3C syndrome | MONDO:0019078 | GARD:0005666 | confirmed | | ||
| 3C syndrome | MONDO:0019078 | Orphanet:7 | confirmed | | ||
|
||
The conflict here is in the first of these rows. The synonym sync wants to update the synonym scope on MONDO:0009073, as evidenced by OMIM:220210. However, in changing it to exactSynonym, there would be a conflict, because that synonym already exists on MONDO:0019078, where it is evidenced by GARD:0005666 and Orphanet:7. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the code should only be checking between "-added" and "-confirmed", there should be no reports from updated at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll update the example now!
I changed the case example to 'added and:
The conflict here is in the first of these rows. The synonym sync wants to add this synonym to MONDO:0009073, as evidenced by OMIM:220210. However this introduces a conflict because that synonym already exists on MONDO:0019078, where it is evidenced by GARD:0005666 and Orphanet:7.
--outpath-confirmed $(SYN_SYNC_DIR)/$*.synonyms.confirmed.robot.tsv \ | ||
--outpath-updated $(SYN_SYNC_DIR)/$*.synonyms.updated.robot.tsv \ | ||
--outpath-combined $(TMPDIR)/synonym_sync_combined_cases_$*.tsv | ||
--outpath-added $(TMPDIR)/$*.synonyms.added.robot.tsv \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on #773, the "added" file contains row 1 starting with mondo_id
and on row 2 starting with ID
. However, when I open this file in Excel to check the data, the only column with data and a row 2 column header (ie ROBOT column header) is ID
. It looks like the tabs are not correct in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This is big, thanks for the catch. I'm not sure what is up with these files being incorrect. I'll look into it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug fixes
So, I have fixed a few bugs here. But let me set up some context.
This is a follow-up of:
Somehow, builds ran off of that, we reviewed it, and yet several issues slipped in:
- Build breaking bug, now fixed: Bug fix:
ModuleNotFoundError
#772 - Build breaking bug, now fixed: Bug fix:
KeyError
onmondo_label
#775 -confirmed
robot subheader was being removed-updated
and-added
were having contents removed from several columns
These things are now fixed, and there is a new mini-build:
I think this PR is doing what it needs to now. But... (see below)
Problems validating the results via diffs
It's important to note that the mini build shows the changes in these files from after #751 until now. But since #751 was bugged, I don't know how valuable this diff will be. It would also be nice to look at a diff between the state of develop before #751 and compare it to this PR, but because of columnar differences, I don't think the diff will be helpful.
Sigh This whole matter is further complicated by the fact that I was running mini builds for this PR and seeing confusing diffs. In the process, I discovered and fixed a sorting error:
If we want to see the results of this PR in diff form, we should merge #777 first into develop, then I can merge it into here.
But that's not all. I'm still seeing issues of indeterministic output with the synonym sync, introduced months ago by these case variation columns. The synonym_case_diff_mondo
column sometimes gets populated correctly, and other times does not. Can observe in the screenshots below.
Screenshots
These screenshots are DiffMerge. This is the result of me running the synonym sync back-to-back with no code changes, but observing diff in the outputs.
![diff1](https://private-user-images.githubusercontent.com/13045020/411443292-f8c6efcb-ee4e-45ff-9f52-c91bd6202d4d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTQ0NDAsIm5iZiI6MTczOTM5NDE0MCwicGF0aCI6Ii8xMzA0NTAyMC80MTE0NDMyOTItZjhjNmVmY2ItZWU0ZS00NWZmLTlmNTItYzkxYmQ2MjAyZDRkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIxMDIyMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWExOTEzOGMyNzZhOTIyMTVhOGQzZDMwMDFiZmU1MWU3ZThmZmI4YjYwYzI2OGExYWI4ZjdmYTYxNDUzN2UzYmUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.OQOEINaFiJ76e7OBRvo1nQ7tqYtbv3feYkqox_uFwBE)
![diff2](https://private-user-images.githubusercontent.com/13045020/411443297-6bdc36c0-36f0-45e8-a48f-be0aaa16ca1c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTQ0NDAsIm5iZiI6MTczOTM5NDE0MCwicGF0aCI6Ii8xMzA0NTAyMC80MTE0NDMyOTctNmJkYzM2YzAtMzZmMC00NWU4LWE0OGYtYmUwYWFhMTZjYTFjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIxMDIyMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg2OWMwZmI3MTdkYjQzNDE0Yzg0MWNmZmIxNmY2YzdjMjViZjk1N2EwYzAzNWY4OWIzYzY0MjdmZWEyOTY5MWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.YyVnTac8W_zkh5R9kpRFDI5s4cDbDbHUyQsejnp8p4U)
I didn't see a quick fix. Maybe there is one. But if not, it might be a better use of time to fix it effectively via:
There are several mini builds linked in the OP (that I ran at various stages), but I closed them as I'm not sure how useful the diffs currently are.
I have another normal build for this running; I will let it finish and push.
--outpath-updated $(SYN_SYNC_DIR)/$*.synonyms.updated.robot.tsv \ | ||
--outpath-combined $(TMPDIR)/synonym_sync_combined_cases_$*.tsv | ||
--outpath-added $(TMPDIR)/$*.synonyms.added.robot.tsv \ | ||
--outpath-confirmed $(TMPDIR)/$*.synonyms.confirmed.robot.tsv \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on #773, the "confirmed" file contains only one header row with these column headers:
mondo_id mondo_label synonym_scope_source synonym_scope_mondo synonym synonym_case_mondo synonym_case_diff_mondo synonym_case_mondo_is_many synonym_case_source synonym_case_diff_source synonym_case_source_is_many source_id source_label synonym_type synonym_type_mondo mondo_evidence case exact_synonym exact_source_id exact_synonym_type broad_synonym broad_source_id broad_synonym_type narrow_synonym narrow_source_id narrow_synonym_type related_synonym related_source_id related_synonym_type
Why are there some many changes in the file, is there still an issue with the file sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script doesn't (isn't supposed to) mutate -confirmed
. There was a bug I just fixed where the robot subheader was being removed.
I'm going to explain everything related to the issues with confirmed, updated, and added in 1 comment above.
--outpath-combined $(TMPDIR)/synonym_sync_combined_cases_$*.tsv | ||
--outpath-added $(TMPDIR)/$*.synonyms.added.robot.tsv \ | ||
--outpath-confirmed $(TMPDIR)/$*.synonyms.confirmed.robot.tsv \ | ||
--outpath-updated $(TMPDIR)/$*.synonyms.updated.robot.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on #773, the "updated" file that contains the scope-mismatch results, also only contains one row with these column headers:
synonym synonym_scope mondo_id source_id synonym_type case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't observe this with there only being 1 row with these column headers. What I did observe with the -updated
file was the same issue you described with -added. Now fixed!
- Update: Docs to reflect recent changes
- Delete: Files that are now going to be in tmp/ instead.
- Bug fix: Columns were getting dropped
- Bug fix: Robot subheader missing from -confirmed template
Putting in draft mode until addressing reviewing and addressing anything that comes up in: |
Overview
Further updates to scripting that prevents multiple exact synonyms from appearing on different Mondo IDs.
Most important change:
-updated
templates.Other changes:
reports/sync-synonym/review-qc-duplicate-exact-synonym-no-abbrev.tsv
intoreports/README.md
Pre-merge checklist
Documentation
Was the documentation added/updated under
docs/
?QC
Was the full pipeline run before submitting this PR using
sh run.sh make build-mondo-ingest
on this branch (afterdocker pull obolibrary/odkfull:dev
), and no errors occurred?Mini builds:
qc-duplicate-exact-synonym-no-abbrev
related updates - mini build #770qc-duplicate-exact-synonym-no-abbrev
related updates && Remove unused files - mini build 2 #774qc-duplicate-exact-synonym-no-abbrev
related updates && Remove unused files - mini build 3 #776qc-duplicate-exact-synonym-no-abbrev
related updates && Remove unused files - mini build 4 #781Builds:
qc-duplicate-exact-synonym-no-abbrev
related updates - build #771qc-duplicate-exact-synonym-no-abbrev
related updates && Remove unused files - build #773qc-duplicate-exact-synonym-no-abbrev
related updates && Remove unused files - build 3 #780New Packages
Were any new Python packages added?
Were any other non-Python packages added?
PR Review and Conversations Resolved
Has the PR been sufficiently reviewed by at least 1 team member of the Mondo Technical team and all threads resolved?
Additional notes
qc-duplicate-exact-synonym-no-abbrev
failures inmondo
#751