Skip to content

Commit 5cf6c2e

Browse files
committed
add news copy benchmark
1 parent c6e5787 commit 5cf6c2e

File tree

9 files changed

+243
-20
lines changed

9 files changed

+243
-20
lines changed

Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@ test: up
2525
docker compose cp local:/app/cobertura.xml cobertura.xml
2626

2727
benchmark: up
28-
docker compose exec local poetry run python tests/test_benchmark.py
28+
docker compose exec local poetry run python tests/test_benchmark_core.py
29+
docker compose exec local poetry run python tests/test_benchmark_news.py
2930

3031
spark_test: up
3132
docker compose exec local poetry run pytest -vvv -s --doctest-modules tests/test_minhash_spark.py

README.md

Lines changed: 51 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -220,25 +220,59 @@ INFO After : 47045
220220

221221
## Benchmarks
222222

223-
A script is provided to benchmark some of the algorithms on `pinecone/core-2020-05-10-deduplication` can be found in `tests/test_benchmark.py`:
224-
225-
| Algorithm | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score | Accuracy | Time |
226-
| :--------------------------- | ---------------------: | ------------------: | -------------------------: | ----------------------: | -------------: | ---------: | :------ |
227-
| MinHash Spark | 0.957 | 0.9445 | 0.9471 | 0.959 | **0.952** | **0.9202** | 698.76s |
228-
| MinHash | 0.9594 | 0.9445 | 0.9474 | 0.9616 | **0.9534** | **0.924** | 18.80s |
229-
| SimHash | 0.9007 | 0.6786 | 0.7681 | 0.9343 | 0.8344 | 0.8137 | 253.94s |
230-
| Exact Title | 0.8302 | 0.5521 | 0.7098 | 0.9065 | 0.77 | 0.7456 | - |
231-
| Exact Title Matching * | 0.830 | 0.50 | 0.709 | 0.992 | 0.757 | 0.746 | - |
232-
| Simhash Matching * | 0.697 | 0.247 | 0.598 | 0.985 | 0.631 | 0.616 | - |
233-
| Document Vector Similarity * | 0.912 | 0.779 | 0.861 | 0.986 | 0.885 | 0.883 | - |
234-
| Hybrid Method * | 0.908 | 0.828 | 0.899 | 0.979 | 0.904 | 0.903 | - |
235-
236-
\* [(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113)
237-
238-
\*\* Best SimHash result from `benchmarks/hyperparameter.ipynb`
223+
### pinecone/core-2020-05-10-deduplication
224+
225+
See `tests/test_benchmark_core.py` for reproduction.
226+
227+
| Algorithm | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score | Accuracy | Time |
228+
| :------------------------------ | ---------------------: | ------------------: | -------------------------: | ----------------------: | -------------: | ---------: | :------ |
229+
| MinHash Spark | 0.957 | 0.9445 | 0.9471 | 0.959 | **0.952** | **0.9202** | 698.76s |
230+
| MinHash | 0.9594 | 0.9445 | 0.9474 | 0.9616 | **0.9534** | **0.924** | 18.80s |
231+
| SimHash** | 0.9007 | 0.6786 | 0.7681 | 0.9343 | 0.8344 | 0.8137 | 253.94s |
232+
| Exact Title | 0.8302 | 0.5521 | 0.7098 | 0.9065 | 0.77 | 0.7456 | - |
233+
| Exact Title Matching [^1] | 0.830 | 0.50 | 0.709 | 0.992 | 0.757 | 0.746 | - |
234+
| Simhash Matching [^1] | 0.697 | 0.247 | 0.598 | 0.985 | 0.631 | 0.616 | - |
235+
| Document Vector Similarity [^1] | 0.912 | 0.779 | 0.861 | 0.986 | 0.885 | 0.883 | - |
236+
| Hybrid Method [^1] | 0.908 | 0.828 | 0.899 | 0.979 | 0.904 | 0.903 | - |
237+
| LaBSE[^2] | 0.937 | 0.923 | 0.930 | 0.943 | 0.933 | 0.919 | - |
238+
| Multilingual USE[^2] | 0.917 | 0.907 | 0.918 | 0.927 | 0.917 | 0.909 | - |
239+
| Multilingual E5-Base[^2] | 0.931 | 0.908 | 0.919 | 0.939 | 0.924 | 0.920 | - |
240+
| MinHash + LSH[^2] | 0.929 | 0.902 | 0.915 | 0.938 | 0.921 | 0.918 | - |
241+
| RETSimPartial-Dup[^2] | 0.945 | 0.941 | 0.945 | 0.949 | 0.945 | **0.928** | - |
242+
| RETSimNear-Dup[^2] | 0.928 | 0.937 | 0.942 | 0.934 | 0.935 | **0.926** | - |
243+
244+
245+
### NEWS-COPY
246+
247+
See `tests/test_benchmark_news.py` for reproduction.
248+
249+
Adjusted Rand Index (ARI) on NEWS-COPY dataset:
250+
251+
| Model/Algorithm | ARI |
252+
| :----------------------- | :-------- |
253+
| n-gram [^3] | 0.440 |
254+
| SimHash | 0.612 |
255+
| SimHash[^2] | 0.695 |
256+
| MinHash | 0.742 |
257+
| MinHash[^3] | 0.737 |
258+
| MinHash[^2] | 0.783 |
259+
| Multilingual USE[^2] | 0.730 |
260+
| Multilingual E5-Base[^2] | 0.742 |
261+
| S-BERT[^3] | 0.700 |
262+
| RETSim Partial-Dup[^2] | 0.831 |
263+
| RETSim Near-Dup[^2] | 0.704 |
264+
| Re-ranking [^3] | **0.937** |
265+
| Bi-encoder [^3] | 0.915 |
266+
267+
268+
[^1]: [(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113)
269+
[^2]: [RETSim: Resilient and Efficient Text Similarity](https://arxiv.org/abs/2311.17264)
270+
[^3]: [Noise-Robust De-Duplication at Scale](https://www.semanticscholar.org/paper/Noise-Robust-De-Duplication-at-Scale-Silcock-D'Amico-Wong/7ca41cc5fc364b713aba5b573ae4ada801fd788a)
239271

240272
> [!note]
241-
> Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.
273+
> 1. Best SimHash result from `benchmarks/hyperparameter.ipynb`
274+
> 2. Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.
275+
242276

243277
<!-- ## FAQ
244278

compose.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ services:
99
- ./docs:/app/docs
1010
- ./tests:/app/tests
1111
- ./text_dedup:/app/text_dedup
12+
- ./data:/app/data

poetry.lock

Lines changed: 65 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ pytest = "^8.0.2"
3535
coverage = "^7.4.3"
3636
ruff = "^0.3.2"
3737
tabulate = "^0.9.0"
38+
scikit-learn = "^1.4.1.post1"
3839

3940
[build-system]
4041
requires = ["poetry-core"]
File renamed without changes.

tests/test_benchmark_news.py

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
import os
2+
import pickle # nosec
3+
4+
import click
5+
import datasets
6+
import pandas as pd
7+
from sklearn.metrics import adjusted_rand_score
8+
9+
from text_dedup.minhash import main as minhash_main
10+
from text_dedup.simhash import main as simhash_main
11+
from text_dedup.utils import IOArgs
12+
from text_dedup.utils import MetaArgs
13+
from text_dedup.utils import MinHashArgs
14+
from text_dedup.utils import SimHashArgs
15+
from text_dedup.utils.preprocessing import news_copy_preprocessing
16+
from text_dedup.utils.timer import Timer
17+
from text_dedup.utils.union_find import UnionFind
18+
19+
NUM_PROC = os.cpu_count()
20+
21+
22+
def prepare_data(data_path, label_path, output_path_ds, output_path_spark):
23+
df = pd.read_json(data_path).T.reset_index()
24+
labels = pd.read_json(label_path)
25+
id2data = []
26+
filename2id = {}
27+
uf = UnionFind()
28+
29+
for i, row in df.iterrows():
30+
id2data.append(
31+
{
32+
"filename": str(row["id"]),
33+
"headline": news_copy_preprocessing(str(row["headline"])),
34+
"text": news_copy_preprocessing(str(row["headline"] + " " + row["article"])),
35+
"article": news_copy_preprocessing(str(row["article"])),
36+
"id": int(i),
37+
}
38+
)
39+
filename2id[id2data[i]["filename"]] = i
40+
41+
for i, row in labels.iterrows():
42+
uf.union(filename2id[row[0]], filename2id[row[1]])
43+
44+
clusters = [None for _ in range(len(df))]
45+
for i in range(len(df)):
46+
clusters[i] = uf.find(filename2id[id2data[i]["filename"]])
47+
48+
ds = datasets.Dataset.from_pandas(pd.DataFrame(id2data))
49+
ds.save_to_disk(output_path_ds)
50+
51+
os.makedirs(output_path_spark, exist_ok=True)
52+
pd.DataFrame(id2data).to_parquet(output_path_spark + "/data.parquet")
53+
54+
return clusters
55+
56+
57+
def uf2results(labels, output_path):
58+
with open(output_path, "rb") as f:
59+
uf = pickle.load(f) # nosec
60+
61+
predictions = [uf.find(i) for i in range(len(labels))]
62+
return adjusted_rand_score(labels, predictions)
63+
64+
65+
if __name__ == "__main__":
66+
t = Timer()
67+
68+
output_path_ds = "news_input_ds"
69+
output_path_spark = "news_input_spark"
70+
71+
test_data = ("./data/test_inf_data.json", "./data/full_test_gt.json")
72+
val_data = ("./data/1955_inf_data.json", "./data/1955_gt.json")
73+
labels = prepare_data(*test_data, output_path_ds, output_path_spark)
74+
75+
io_args = IOArgs(
76+
path=output_path_ds,
77+
local=True,
78+
num_proc=NUM_PROC,
79+
cache_dir=".cache",
80+
output="./news_output_minhash",
81+
debug=True,
82+
clean_cache=True,
83+
)
84+
meta_args = MetaArgs(column="article", batch_size=10000)
85+
86+
# TODO: hyperparameter tuning
87+
with t("MinHash"):
88+
ctx = click.Context(minhash_main)
89+
minhash_args = MinHashArgs(num_perm=256, ngram=2, min_length=0, threshold=0.45)
90+
io_args.output = minhash_output = "./news_output_minhash"
91+
ctx.invoke(
92+
minhash_main,
93+
io_args=io_args,
94+
meta_args=meta_args,
95+
minhash_args=minhash_args,
96+
)
97+
98+
# TODO: hyperparameter tuning
99+
with t("SimHash"):
100+
ctx = click.Context(simhash_main)
101+
simhash_args = SimHashArgs(bit_diff=12, num_bucket=13, ngram=5)
102+
io_args.output = simhash_output = "./temp_output_simhash"
103+
ctx.invoke(
104+
simhash_main,
105+
io_args=io_args,
106+
meta_args=meta_args,
107+
simhash_args=simhash_args,
108+
)
109+
110+
print(f"MinHash ARI: {uf2results(labels, f'{minhash_output}/uf.pkl')}")
111+
print(f"SimHash ARI: {uf2results(labels, f'{simhash_output}/uf.pkl')}")

text_dedup/simhash.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,9 @@ def __init__(self, f: int, k: int, b: int, masks: list[tuple[bitarray, int, int,
103103

104104
self.masks.append(mask)
105105

106-
assert sum(self.widths) == f, "The sum of block widths must be equal to the fingerprint size"
106+
assert (
107+
sum(self.widths) == f
108+
), f"The sum of block widths {sum(self.widths)} must be equal to the fingerprint size {f}"
107109

108110
prefix_width = sum(self.widths[: b - k])
109111
self.search_mask: bitarray = bitarray(f)
@@ -191,9 +193,12 @@ def _create_permutations(f: int, k: int, b: int) -> list[Permutation]:
191193
"""
192194
block_size: int = math.ceil(f / b)
193195
masks: list[tuple[bitarray, int, int, int]] = []
196+
b = min(b, math.ceil(f / block_size))
194197

195198
for i in range(b):
196199
start, end = i * block_size, min((i + 1) * block_size, f)
200+
if start >= end:
201+
break
197202
mask: bitarray = bitarray(f)
198203
mask.setall(0)
199204
mask[start:end] = 1

text_dedup/utils/preprocessing.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
def news_copy_preprocessing(text: str) -> str:
2+
chars_to_remove = r'"#$%&\()*+/:;<=>@[\\]^_`{|}~.?,!\''
3+
text = text.replace("-\n", "").replace("\n", " ")
4+
text = text.translate(str.maketrans("", "", chars_to_remove))
5+
text = text.encode("ascii", "ignore").decode()
6+
return text

0 commit comments

Comments
 (0)