MatchMakeR: Heursitc index-based record linkage in R

MatchMakeR is an experimental first step implementation of an index-based heuristic record linkage method. This package reimplements the main ideas from Thorsten Doherr's search engine, originally written in FoxPro. The aim of the package is to link large-scale data by fuzzy criteria such as names and addresses using a word/token-driven heuristic. This approach provides an efficient candidate retrieval mechanism, replacing traditional blocking strategies. Reimplementing this in R makes it easier to use the approach in existing data pipelines.

Features

Efficient Candidate Retrieval: Quickly find potential matches using a word/token-driven heuristic.
Flexible Text Normalization: Standardize text data for better comparison.
Advanced Tokenization: Create tokens using word and n-gram tokenizers.
Phonetic Encoding: Enhance matching accuracy with Metaphone and Soundex encoding.
Duplicate Detection: Identify and group likely duplicate records within a dataset.
Scalable Processing: Handle large datasets with chunk processing capabilities.

Installation

Install MatchMakeR directly from GitHub using devtools:

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# Install TidyLLM from GitHub
devtools::install_github("edubruell/MatchMakeR")

Basic Usage Example

Below is an example of how to use the MatchMakeR package for detecting duplicates and searching for matching candidates.

library(MatchMakeR)

# Load base and target tables
yp_base <- readstata13::read.dta13("yellow_pages_hausaerzte.dta")) |> as.data.table()
base_table <- copy(yp_base[year == 2016 & entry_line == 1][, key_base := entry])
target_table <- copy(yp_base[year == 2017 & entry_line == 1][, key_target := entry])

# Define normalization and tokenization strategy
preparers <- search_preparers(
  Nachname ~ normalize_text + word_tokens(.min_length = 3),
  Vorname ~ normalize_text + word_tokens(.min_length = 3),
  Strasse ~ normalize_text + word_tokens,
  Hausnummer ~ normalize_text + word_tokens,
  Ort ~ normalize_text + word_tokens
)

# Prepare search data
search_table_base <- preapare_search_data(preparers, base_table, "key_base")
search_table_target <- preapare_search_data(preparers, target_table, "key_target")

# Detect duplicates within the base table
likely_duplicates <- detect_duplicates(
  .base_table = search_table_base,
  .base_key = "key_base",
  .threshold = 0.8
)

# Deduplicate the base table
deduplicated_base_table <- deduplicate_table(base_table, likely_duplicates, "key_base")

# Search for matching candidates between base and target tables (Please deduplicate and inspect before doing this)
candidates <- search_candidates(
  .base_table = search_table_base,
  .target_table = search_table_target,
  .base_key = "key_base",
  .target_key = "key_target",
  .threshold = 0.6,
  .weights = c(Hausnummer = 0.1, Nachname = 0.5, Vorname = 0.2, Strasse = 0.1, Ort = 0.1),
  .chunksize = 10000
)

Functions Overview

normalize_text(): Normalize a text string by converting to uppercase, transliterating special characters, retaining only alphanumeric characters and spaces, and removing extra spaces.
as_metaphone(): Convert a text string to its Metaphone encoding.
as_soundex(): Convert a text string to its Soundex encoding.
word_tokens(): Return a list of word tokens for the input text.
use_dictionary(): Group similar tokens together using a pre-defined dictionary.
generate_ngrams(): Generate n-gram tokens from a text string.
search_preparers(): Create a list of preparer functions based on a formula syntax.
preapare_search_data(): Apply the preparer functions to the specified columns in the input data frame to create a search table
search_candidates(): Search for matching candidates between a base table and a target table using token-based heuristic linkage.
detect_duplicates(): Detect duplicate records within a base table using token-based heuristic linkage.
deduplicate_table(): Use the results of detect_duplicates to remove duplicate records from a data table.
build_similarity_dict(): Build a dicionary of similar tokens for a base and a target table.

TODO

Add code to smooth rIP in the search functions (softmax or log rIP).
Add feedback mechanism discusses in original search engine

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
R		R
man		man
DESCRIPTION		DESCRIPTION
MatchMakeR.Rproj		MatchMakeR.Rproj
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatchMakeR: Heursitc index-based record linkage in R

Features

Installation

Basic Usage Example

Functions Overview

TODO

About

Releases

Packages

Languages

edubruell/MatchMakeR

Folders and files

Latest commit

History

Repository files navigation

MatchMakeR: Heursitc index-based record linkage in R

Features

Installation

Basic Usage Example

Functions Overview

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages