-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment: switch to link by log odds window #229
base: main
Are you sure you want to change the base?
Conversation
0.8, | ||
0.925 | ||
], | ||
"maximum_points": 13.2001963043, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be specified? We know the log odds values and we know the evaluators in use for a particular pass, it seems like the algorithm should just be able to calculate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, not in the slightest do we need to define this here. I just started in this file and then changed some structure later, so I can go back and define this calculation on the fly.
@@ -59,6 +62,8 @@ def get_block_data( | |||
) | |||
|
|||
# Using the subquery of unique Patient IDs, select all the Patients | |||
# NOTE: We probably apply the filter here to throw away patients with | |||
# non-empty but wrong blocking fields? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does wrong mean in this context? The join clause will exclude any patient record that doesn't create an exact match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is something completely unrelated to log odds windows that I was just sticking in as a comment. This is building onto the point I made yesterday that because we're no longer calculating belongingness ratio we don't have to fetch all the patients associated with a Person cluster anymore, even if they didn't meet the blocking criterion. Marcelle correctly observed that if we pulled back people from the Person cluster who didn't meet blocking conditions but because they had missing values, we would solve half of the missingness problem for problem. So this comment is in service of that: it seemed like all the patients in each blocked Person got pulled back, so we'd then want to filter that to only include those patients who exact matched blocking and patients from blocked Persons who didn't block because their fields were missing.
src/recordlinker/linking/link.py
Outdated
|
||
|
||
def compare( | ||
record: schemas.PIIRecord, patient: models.Patient, algorithm_pass: models.AlgorithmPass | ||
) -> bool: | ||
) -> typing.Union[bool, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has the potential to make things very messy downstream and I think we need to make a decision here about what to concretely return. When comparing one record to another, does our downstream process need ...
- a summation of points (
float
) - a decision on if the total is over a threshold (
bool
) - both, the summation of points and a decision on whether it was over (
tuple[float, bool]
)
It will be much easier to program the rest of the logic if we always get the same payload back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree here. I think I was focusing too hard on preserving back compatibility, because we can still get "auto-matching" / non-possible-match mode by just setting the window bounds to be equal. I would propose this function to return the summation of log-odds (and maybe we rename the function compare_and_score
to emphasize that) and then the broader link caller handle that sum. Because the comparison proper isn't actually affected by user thresholds, so it doesn't care about normalization or whatever they set.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #229 +/- ##
==========================================
- Coverage 97.69% 97.45% -0.25%
==========================================
Files 32 32
Lines 1651 1691 +40
==========================================
+ Hits 1613 1648 +35
- Misses 38 43 +5 ☔ View full report in Codecov by Sentry. |
@ericbuckley @m-goggins Updated code to make all tests pass (there's still one set of three tests that involved belongingness that I'm not quite sure what to do with, but will continue noodling). I also ran the |
Description
Experimental branch showing how the guts of the code might change switching from a windowed belongingness to a windowed log odds, with Dan's suggestion to create the whole thing as a medical test result with interpretation layered on top (i.e. normalization pushed as far up the pipeline as possible so all values the user deals with are between 0 and 1). Code is by no means PR ready or finalized (e.g. tests not handled, updates to schemas.Prediction and schemas.LinkResult not fully changed, etc.), but wanted to share the ideas.
NOTE: I think we should de-couple the log odds work from missing fields or changes to blocking to avoid scope creep and pushing this feature farther out. I think this milestone should just be about replacing belongingness with lod-odds.
Related Issues
[Link any related issues or tasks from your project management system.]
Additional Notes
[Add any additional context or notes that reviewers should know about.]
<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->
Checklist
Please review and complete the following checklist before submitting your pull request:
Checklist for Reviewers
Please review and complete the following checklist during the review process: