Skip to content

Commit c6e5787

Browse files
committed
update readme
1 parent 759e600 commit c6e5787

File tree

5 files changed

+228
-127
lines changed

5 files changed

+228
-127
lines changed

Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ test: up
2424
docker compose exec local poetry run coverage report -m
2525
docker compose cp local:/app/cobertura.xml cobertura.xml
2626

27+
benchmark: up
28+
docker compose exec local poetry run python tests/test_benchmark.py
29+
2730
spark_test: up
2831
docker compose exec local poetry run pytest -vvv -s --doctest-modules tests/test_minhash_spark.py
2932

README.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -220,22 +220,25 @@ INFO After : 47045
220220

221221
## Benchmarks
222222

223-
A benchmark of different methods here can be found in `benchmarks/wiki40.ipynb`. A notebook in evaluating MinHash on `pinecone/core-2020-05-10-deduplication` can be found in `benchmarks/pinecone.ipynb`.
223+
A script is provided to benchmark some of the algorithms on `pinecone/core-2020-05-10-deduplication` can be found in `tests/test_benchmark.py`:
224224

225-
For quick reference, here are the results:
225+
| Algorithm | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score | Accuracy | Time |
226+
| :--------------------------- | ---------------------: | ------------------: | -------------------------: | ----------------------: | -------------: | ---------: | :------ |
227+
| MinHash Spark | 0.957 | 0.9445 | 0.9471 | 0.959 | **0.952** | **0.9202** | 698.76s |
228+
| MinHash | 0.9594 | 0.9445 | 0.9474 | 0.9616 | **0.9534** | **0.924** | 18.80s |
229+
| SimHash | 0.9007 | 0.6786 | 0.7681 | 0.9343 | 0.8344 | 0.8137 | 253.94s |
230+
| Exact Title | 0.8302 | 0.5521 | 0.7098 | 0.9065 | 0.77 | 0.7456 | - |
231+
| Exact Title Matching * | 0.830 | 0.50 | 0.709 | 0.992 | 0.757 | 0.746 | - |
232+
| Simhash Matching * | 0.697 | 0.247 | 0.598 | 0.985 | 0.631 | 0.616 | - |
233+
| Document Vector Similarity * | 0.912 | 0.779 | 0.861 | 0.986 | 0.885 | 0.883 | - |
234+
| Hybrid Method * | 0.908 | 0.828 | 0.899 | 0.979 | 0.904 | 0.903 | - |
226235

227-
| Method | Precision | Recall | F1** | Time |
228-
| ---------------------------------------------------------------------------------- | ---------- | ---------- | ---------- | ------ |
229-
| MinHash (Spark) | **0.9570** | **0.9445** | **0.9507** | 18.62s |
230-
| MinHash | **0.9594** | **0.945** | **0.9519** | 18s |
231-
| SimHash\* | 0.9007 | 0.6786 | 0.7740 | 210s |
232-
| SimHash[(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113) | 0.697 | 0.247 | 0.3647 | - |
233-
| Exact Title (my implementation) | 0.8302 | 0.5521 | 0.6632 | - |
234-
| Exact Title[(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113) | 0.830 | 0.50 | 0.624 | - |
236+
\* [(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113)
235237

236-
\*Best SimHash result from `benchmarks/hyperparameter.ipynb`.
238+
\*\* Best SimHash result from `benchmarks/hyperparameter.ipynb`
237239

238-
\*\* F1 on duplicates as positives
240+
> [!note]
241+
> Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.
239242
240243
<!-- ## FAQ
241244

poetry.lock

Lines changed: 15 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ python = "^3.10"
1111
numpy = ">=1.26.4"
1212
tqdm = ">=4.64.1"
1313
datasets = ">=2.17.0"
14-
rich = ">=12.5.1"
1514
scipy = ">=1.10.1"
1615
xxhash = ">=3.0.0"
1716
pybloom-live = ">=4.0.0"
@@ -27,13 +26,15 @@ psutil = ">=5.9.8"
2726
fire = "^0.6.0"
2827
click = "^8.1.7"
2928
click-option-group = "^0.5.6"
29+
rich = "^13.7.1"
3030

3131
[tool.poetry.group.dev.dependencies]
3232
pre-commit = "^3.6.2"
3333
insegel = "^1.3.1"
3434
pytest = "^8.0.2"
3535
coverage = "^7.4.3"
3636
ruff = "^0.3.2"
37+
tabulate = "^0.9.0"
3738

3839
[build-system]
3940
requires = ["poetry-core"]

0 commit comments

Comments
 (0)