-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Include the option to retain the __splink__blocked_id_pairs table #2595
Comments
I just had a look at this. The first thing I tried was: import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
db_api = DuckDBAPI()
df = splink_datasets.fake_1000
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch("first_name"),
cl.ExactMatch("surname"),
cl.ExactMatch("dob"),
cl.ExactMatch("city").configure(term_frequency_adjustments=True),
cl.ExactMatch("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
],
retain_intermediate_calculation_columns=False,
retain_matching_columns=False,
)
linker = Linker(df, settings, db_api)
linker.inference.deterministic_link().as_duckdbpyrelation() I was expecting this to work, because the two Howeer, if I remove the comparisons i think it now gives you what you want:
Which gives:
Assuming this works at your end, I think the way I might suggest to address this is that we do a PR that ensures that I'm a bit less keen (but not dead set against) on adding a setting that causes the blocked_id_pairs to be retained, simply because it's quite hard to explain! |
Is your proposal related to a problem?
For our use case, it would be useful for us to be able to track all records that were included in the blocking phase, even if they do not get linked to anything during prediction. At the moment, there is no definitive way to store this. Due to a quirk, perhaps of how big the blocked_id_pairs table is in our use case and how Databricks handles big tables (stores them in Unity Catalog, not deleted after the run), we've noticed that Splink automatically generates this table, but also that it deletes the table from the table cache after running predictions. With the randomised strings, it is difficult to identify which table belonged to the run.
Describe the solution you'd like
It would be useful if we could have a flag in the predict method that stops splink from deleting _blocked_id_pairs, or, alternatively a way to keep the generated table name in the cache so that we can retrieve it from Unity Catalog and use it for analysis. By default it should retain it's current behaviour.
Describe alternatives you've considered
We have considered using the deterministic link function included in inference, but it seems like a waste of computation since we only need the unique id pairs that have already been generated by Splink. We currently have a workaround using some Databricks workflow features but it's not perfect.
Additional context
This might be a highly specialised use case involving Databricks so I appreciate if it's not feasible
The text was updated successfully, but these errors were encountered: