Experiment: switch to link by log odds window #229

bamader · 2025-02-27T14:46:51Z

Description

Experimental branch showing how the guts of the code might change switching from a windowed belongingness to a windowed log odds, with Dan's suggestion to create the whole thing as a medical test result with interpretation layered on top (i.e. normalization pushed as far up the pipeline as possible so all values the user deals with are between 0 and 1). Code is by no means PR ready or finalized (e.g. tests not handled, updates to schemas.Prediction and schemas.LinkResult not fully changed, etc.), but wanted to share the ideas.

NOTE: I think we should de-couple the log odds work from missing fields or changes to blocking to avoid scope creep and pushing this feature farther out. I think this milestone should just be about replacing belongingness with lod-odds.

Related Issues

[Link any related issues or tasks from your project management system.]

Additional Notes

[Add any additional context or notes that reviewers should know about.]

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
I have reviewed my changes to ensure they are clear, concise, and well-documented.
I have updated the documentation, if applicable.
I have added or updated test cases to cover my changes, if applicable.
I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

The code follows best practices and conventions.
The changes implement the desired functionality or fix the reported issue.
The tests cover the new changes and pass successfully.
Any potential edge cases or error scenarios have been considered.

ericbuckley · 2025-02-27T15:50:24Z

src/recordlinker/assets/initial_algorithms.json

+                    0.8,
+                    0.925
+                ],
+                "maximum_points": 13.2001963043,


Does this need to be specified? We know the log odds values and we know the evaluators in use for a particular pass, it seems like the algorithm should just be able to calculate this.

Great catch, not in the slightest do we need to define this here. I just started in this file and then changed some structure later, so I can go back and define this calculation on the fly.

ericbuckley · 2025-02-27T15:52:24Z

src/recordlinker/database/mpi_service.py

@@ -59,6 +62,8 @@ def get_block_data(
        )

    # Using the subquery of unique Patient IDs, select all the Patients
+    # NOTE: We probably apply the filter here to throw away patients with
+    # non-empty but wrong blocking fields?


What does wrong mean in this context? The join clause will exclude any patient record that doesn't create an exact match.

Ah this is something completely unrelated to log odds windows that I was just sticking in as a comment. This is building onto the point I made yesterday that because we're no longer calculating belongingness ratio we don't have to fetch all the patients associated with a Person cluster anymore, even if they didn't meet the blocking criterion. Marcelle correctly observed that if we pulled back people from the Person cluster who didn't meet blocking conditions but because they had missing values, we would solve half of the missingness problem for problem. So this comment is in service of that: it seemed like all the patients in each blocked Person got pulled back, so we'd then want to filter that to only include those patients who exact matched blocking and patients from blocked Persons who didn't block because their fields were missing.

ericbuckley · 2025-02-27T16:02:32Z

src/recordlinker/linking/link.py



 def compare(
    record: schemas.PIIRecord, patient: models.Patient, algorithm_pass: models.AlgorithmPass
-) -> bool:
+) -> typing.Union[bool, float]:


I think this has the potential to make things very messy downstream and I think we need to make a decision here about what to concretely return. When comparing one record to another, does our downstream process need ...

a summation of points (float)

a decision on if the total is over a threshold (bool)

both, the summation of points and a decision on whether it was over (tuple[float, bool])

It will be much easier to program the rest of the logic if we always get the same payload back.

I completely agree here. I think I was focusing too hard on preserving back compatibility, because we can still get "auto-matching" / non-possible-match mode by just setting the window bounds to be equal. I would propose this function to return the summation of log-odds (and maybe we rename the function compare_and_score to emphasize that) and then the broader link caller handle that sum. Because the comparison proper isn't actually affected by user thresholds, so it doesn't care about normalization or whatever they set.

codecov · 2025-02-28T14:04:54Z

Codecov Report

Attention: Patch coverage is 90.80460% with 8 lines in your changes missing coverage. Please review.

Project coverage is 97.45%. Comparing base (d54af84) to head (11a8eda).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/recordlinker/linking/link.py	87.30%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #229      +/-   ##
==========================================
- Coverage   97.69%   97.45%   -0.25%     
==========================================
  Files          32       32              
  Lines        1651     1691      +40     
==========================================
+ Hits         1613     1648      +35     
- Misses         38       43       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bamader · 2025-02-28T14:31:16Z

@ericbuckley @m-goggins Updated code to make all tests pass (there's still one set of three tests that involved belongingness that I'm not quite sure what to do with, but will continue noodling).

I also ran the algorithm tests using the certain hierarchy I established, and we doubled our performance! If we were getting 33% last time (two out of the six cases), we now get 4 out of the 6 cases correct. We correctly get 3 out of the 5 match cases, and correctly flag the invalid birthdate. The cases we miss are (1) the record has first and last name switched (there's nothing we can do about that, since first and last are blocking fields in one pass [they fail exact matching] and then evaluation fields in another [but they earn 0 points]); and (2) a test case where someone named Tho-mas fails to match to a cluster that contains 2 identical patients except for their first names, Thomas and ThoMas. I believe if we implemented basic string normalization on incoming names (e.g. standardize casing, remove numbers and punctuation) we would catch this case. We don't even have to do it at the persistence level so that data is still preserved as supplied; we could apply it on the fly during record evaluation, just like we proposed with skip values. This would allow us to still correctly model names that might actually have punctuation (e.g. the Arabic surname Al'Charif, also sometimes written Al-Charif) but wouldn't penalize one of those representations over the other. I believe this is a bug we should fix, but wanted to drop my findings here before I actually took off for the day.

Experiment: switch to link by log odds window

902b5fc

bamader requested review from ericbuckley and m-goggins as code owners February 27, 2025 14:46

bamader marked this pull request as draft February 27, 2025 14:46

ericbuckley reviewed Feb 27, 2025

View reviewed changes

bamader added 3 commits February 27, 2025 14:41

Remove belongingness; possible match window tests

8b06b03

Fix some more tests

757c3ac

All tests passing

11a8eda

ericbuckley added this to the v25.5.0 milestone Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: switch to link by log odds window #229

Experiment: switch to link by log odds window #229

bamader commented Feb 27, 2025

ericbuckley Feb 27, 2025

bamader Feb 27, 2025

ericbuckley Feb 27, 2025

bamader Feb 27, 2025

ericbuckley Feb 27, 2025

bamader Feb 27, 2025

codecov bot commented Feb 28, 2025

bamader commented Feb 28, 2025

Experiment: switch to link by log odds window #229

Are you sure you want to change the base?

Experiment: switch to link by log odds window #229

Conversation

bamader commented Feb 27, 2025

Description

Related Issues

Additional Notes

Checklist

Checklist for Reviewers

ericbuckley Feb 27, 2025

Choose a reason for hiding this comment

bamader Feb 27, 2025

Choose a reason for hiding this comment

ericbuckley Feb 27, 2025

Choose a reason for hiding this comment

bamader Feb 27, 2025

Choose a reason for hiding this comment

ericbuckley Feb 27, 2025

Choose a reason for hiding this comment

bamader Feb 27, 2025

Choose a reason for hiding this comment

codecov bot commented Feb 28, 2025

Codecov Report

bamader commented Feb 28, 2025