Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Include the option to retain the __splink__blocked_id_pairs table #2595

Open
fscholes opened this issue Jan 29, 2025 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@fscholes
Copy link

Is your proposal related to a problem?

For our use case, it would be useful for us to be able to track all records that were included in the blocking phase, even if they do not get linked to anything during prediction. At the moment, there is no definitive way to store this. Due to a quirk, perhaps of how big the blocked_id_pairs table is in our use case and how Databricks handles big tables (stores them in Unity Catalog, not deleted after the run), we've noticed that Splink automatically generates this table, but also that it deletes the table from the table cache after running predictions. With the randomised strings, it is difficult to identify which table belonged to the run.

Describe the solution you'd like

It would be useful if we could have a flag in the predict method that stops splink from deleting _blocked_id_pairs, or, alternatively a way to keep the generated table name in the cache so that we can retrieve it from Unity Catalog and use it for analysis. By default it should retain it's current behaviour.

Describe alternatives you've considered

We have considered using the deterministic link function included in inference, but it seems like a waste of computation since we only need the unique id pairs that have already been generated by Splink. We currently have a workaround using some Databricks workflow features but it's not perfect.

Additional context

This might be a highly specialised use case involving Databricks so I appreciate if it's not feasible

@fscholes fscholes added the enhancement New feature or request label Jan 29, 2025
@RobinL
Copy link
Member

RobinL commented Jan 30, 2025

I just had a look at this. The first thing I tried was:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    retain_intermediate_calculation_columns=False,
    retain_matching_columns=False,
)

linker = Linker(df, settings, db_api)

linker.inference.deterministic_link().as_duckdbpyrelation()

I was expecting this to work, because the two retain_ settings are set to False. However, it does not (I think you already found that)

Howeer, if I remove the comparisons i think it now gives you what you want:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    retain_intermediate_calculation_columns=False,
    retain_matching_columns=False,
)

linker = Linker(df, settings, db_api)

linker.inference.deterministic_link().as_duckdbpyrelation()

Which gives:

┌─────────────┬─────────────┬───────────┐
│ unique_id_l │ unique_id_r │ match_key │
│    int64    │    int64    │  varchar  │
├─────────────┼─────────────┼───────────┤
│           0 │           3 │ 0         │
│           1 │           3 │ 0         │
│           4 │         596 │ 0         │
│           5 │         596 │ 0         │
│           6 │         820 │ 0         │
│           9 │         922 │ 0         │
│          11 │         206 │ 0         │
│          13 │         206 │ 0         │
│          14 │         999 │ 0         │
│          18 │         475 │ 0         │
│           · │          ·  │ ·         │
│           · │          ·  │ ·         │
│           · │          ·  │ ·         │
│         719 │         724 │ 1         │
│         720 │         724 │ 1         │
│         723 │         724 │ 1         │
│         137 │         365 │ 1         │
│         138 │         365 │ 1         │
│         139 │         365 │ 1         │
│         140 │         365 │ 1         │
│         141 │         365 │ 1         │
│         171 │         174 │ 1         │
│         173 │         174 │ 1         │
├─────────────┴─────────────┴───────────┤
│ 3349 rows (20 shown)        3 columns │
└───────────────────────────────────────┘

Assuming this works at your end, I think the way I might suggest to address this is that we do a PR that ensures that deterministic_link obeys the retain_ settings. Does that sounds sensible to you>

I'm a bit less keen (but not dead set against) on adding a setting that causes the blocked_id_pairs to be retained, simply because it's quite hard to explain!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants