You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+15-12Lines changed: 15 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -220,22 +220,25 @@ INFO After : 47045
220
220
221
221
## Benchmarks
222
222
223
-
A benchmark of different methods here can be found in `benchmarks/wiki40.ipynb`. A notebook in evaluating MinHash on `pinecone/core-2020-05-10-deduplication` can be found in `benchmarks/pinecone.ipynb`.
223
+
A script is provided to benchmark some of the algorithms on `pinecone/core-2020-05-10-deduplication` can be found in `tests/test_benchmark.py`:
\*[(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113)
235
237
236
-
\*Best SimHash result from `benchmarks/hyperparameter.ipynb`.
238
+
\*\*Best SimHash result from `benchmarks/hyperparameter.ipynb`
237
239
238
-
\*\* F1 on duplicates as positives
240
+
> [!note]
241
+
> Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.
0 commit comments