[FEAT] Include the option to retain the splinkblocked_id_pairs table #2595

fscholes · 2025-01-29T10:50:11Z

Is your proposal related to a problem?

For our use case, it would be useful for us to be able to track all records that were included in the blocking phase, even if they do not get linked to anything during prediction. At the moment, there is no definitive way to store this. Due to a quirk, perhaps of how big the blocked_id_pairs table is in our use case and how Databricks handles big tables (stores them in Unity Catalog, not deleted after the run), we've noticed that Splink automatically generates this table, but also that it deletes the table from the table cache after running predictions. With the randomised strings, it is difficult to identify which table belonged to the run.

Describe the solution you'd like

It would be useful if we could have a flag in the predict method that stops splink from deleting _blocked_id_pairs, or, alternatively a way to keep the generated table name in the cache so that we can retrieve it from Unity Catalog and use it for analysis. By default it should retain it's current behaviour.

Describe alternatives you've considered

We have considered using the deterministic link function included in inference, but it seems like a waste of computation since we only need the unique id pairs that have already been generated by Splink. We currently have a workaround using some Databricks workflow features but it's not perfect.

Additional context

This might be a highly specialised use case involving Databricks so I appreciate if it's not feasible

RobinL · 2025-01-30T13:31:58Z

I just had a look at this. The first thing I tried was:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    retain_intermediate_calculation_columns=False,
    retain_matching_columns=False,
)

linker = Linker(df, settings, db_api)

linker.inference.deterministic_link().as_duckdbpyrelation()

I was expecting this to work, because the two retain_ settings are set to False. However, it does not (I think you already found that)

Howeer, if I remove the comparisons i think it now gives you what you want:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    retain_intermediate_calculation_columns=False,
    retain_matching_columns=False,
)

linker = Linker(df, settings, db_api)

linker.inference.deterministic_link().as_duckdbpyrelation()

Which gives:

┌─────────────┬─────────────┬───────────┐
│ unique_id_l │ unique_id_r │ match_key │
│    int64    │    int64    │  varchar  │
├─────────────┼─────────────┼───────────┤
│           0 │           3 │ 0         │
│           1 │           3 │ 0         │
│           4 │         596 │ 0         │
│           5 │         596 │ 0         │
│           6 │         820 │ 0         │
│           9 │         922 │ 0         │
│          11 │         206 │ 0         │
│          13 │         206 │ 0         │
│          14 │         999 │ 0         │
│          18 │         475 │ 0         │
│           · │          ·  │ ·         │
│           · │          ·  │ ·         │
│           · │          ·  │ ·         │
│         719 │         724 │ 1         │
│         720 │         724 │ 1         │
│         723 │         724 │ 1         │
│         137 │         365 │ 1         │
│         138 │         365 │ 1         │
│         139 │         365 │ 1         │
│         140 │         365 │ 1         │
│         141 │         365 │ 1         │
│         171 │         174 │ 1         │
│         173 │         174 │ 1         │
├─────────────┴─────────────┴───────────┤
│ 3349 rows (20 shown)        3 columns │
└───────────────────────────────────────┘

Assuming this works at your end, I think the way I might suggest to address this is that we do a PR that ensures that deterministic_link obeys the retain_ settings. Does that sounds sensible to you>

I'm a bit less keen (but not dead set against) on adding a setting that causes the blocked_id_pairs to be retained, simply because it's quite hard to explain!

fscholes added the enhancement New feature or request label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Include the option to retain the splinkblocked_id_pairs table #2595

[FEAT] Include the option to retain the splinkblocked_id_pairs table #2595

fscholes commented Jan 29, 2025

RobinL commented Jan 30, 2025 •

edited

Loading

[FEAT] Include the option to retain the __splink__blocked_id_pairs table #2595

[FEAT] Include the option to retain the __splink__blocked_id_pairs table #2595

Comments

fscholes commented Jan 29, 2025

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

RobinL commented Jan 30, 2025 • edited Loading

[FEAT] Include the option to retain the splinkblocked_id_pairs table #2595

[FEAT] Include the option to retain the splinkblocked_id_pairs table #2595

RobinL commented Jan 30, 2025 •

edited

Loading