ChenghaoMou
diff --git a/‎Makefile
Lines changed: 3 additions & 0 deletions b/‎Makefile
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 15 additions & 12 deletions b/‎README.md
Lines changed: 15 additions & 12 deletions
diff --git a/‎poetry.lock
Lines changed: 15 additions & 1 deletion b/‎poetry.lock
Lines changed: 15 additions & 1 deletion
diff --git a/‎pyproject.toml
Lines changed: 2 additions & 1 deletion b/‎pyproject.toml
Lines changed: 2 additions & 1 deletion
@@ -24,6 +24,9 @@ test: up
 	docker compose exec local poetry run coverage report -m
 	docker compose cp local:/app/cobertura.xml cobertura.xml
 
+benchmark: up
+	docker compose exec local poetry run python tests/test_benchmark.py
+
 spark_test: up
 	docker compose exec local poetry run pytest -vvv -s --doctest-modules tests/test_minhash_spark.py
 
 
@@ -220,22 +220,25 @@ INFO     After                         : 47045
 
 ## Benchmarks
 
-A benchmark of different methods here can be found in `benchmarks/wiki40.ipynb`. A notebook in evaluating MinHash on `pinecone/core-2020-05-10-deduplication` can be found in `benchmarks/pinecone.ipynb`.
+A script is provided to benchmark some of the algorithms on `pinecone/core-2020-05-10-deduplication` can be found in `tests/test_benchmark.py`:
 
-For quick reference, here are the results:
+| Algorithm                    | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score |   Accuracy | Time    |
+| :--------------------------- | ---------------------: | ------------------: | -------------------------: | ----------------------: | -------------: | ---------: | :------ |
+| MinHash Spark                |                  0.957 |              0.9445 |                     0.9471 |                   0.959 |      **0.952** | **0.9202** | 698.76s |
+| MinHash                      |                 0.9594 |              0.9445 |                     0.9474 |                  0.9616 |     **0.9534** |  **0.924** | 18.80s  |
+| SimHash                      |                 0.9007 |              0.6786 |                     0.7681 |                  0.9343 |         0.8344 |     0.8137 | 253.94s |
+| Exact Title                  |                 0.8302 |              0.5521 |                     0.7098 |                  0.9065 |           0.77 |     0.7456 | -       |
+| Exact Title Matching *       |                  0.830 |                0.50 |                      0.709 |                   0.992 |          0.757 |      0.746 | -       |
+| Simhash Matching *           |                  0.697 |               0.247 |                      0.598 |                   0.985 |          0.631 |      0.616 | -       |
+| Document Vector Similarity * |                  0.912 |               0.779 |                      0.861 |                   0.986 |          0.885 |      0.883 | -       |
+| Hybrid Method *              |                  0.908 |               0.828 |                      0.899 |                   0.979 |          0.904 |      0.903 | -       |
 
-| Method                                                                             | Precision  | Recall     | F1**         | Time   |
-| ---------------------------------------------------------------------------------- | ---------- | ---------- | ---------- | ------ |
-| MinHash (Spark)                                                                    | **0.9570** | **0.9445** | **0.9507** | 18.62s |
-| MinHash                                                                            | **0.9594** | **0.945**  | **0.9519** | 18s    |
-| SimHash\*                                                                          | 0.9007     | 0.6786     | 0.7740     | 210s   |
-| SimHash[(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113)     | 0.697      | 0.247      | 0.3647     | -      |
-| Exact Title (my implementation)                                                    | 0.8302     | 0.5521     | 0.6632     | -      |
-| Exact Title[(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113) | 0.830      | 0.50       | 0.624      | -      |
+\* [(Gyawali et al., LREC 2020)](https://aclanthology.org/2020.lrec-1.113)
 
-\*Best SimHash result from `benchmarks/hyperparameter.ipynb`.
+\*\* Best SimHash result from `benchmarks/hyperparameter.ipynb`
 
-\*\* F1 on duplicates as positives
+> [!note]
+> Spark implementation has some overhead for small datasets, so I recommend using the script only when you have a large dataset and enough compute resources.
 
 <!-- ## FAQ
 
 
@@ -11,7 +11,6 @@ python = "^3.10"
 numpy = ">=1.26.4"
 tqdm = ">=4.64.1"
 datasets = ">=2.17.0"
-rich = ">=12.5.1"
 scipy = ">=1.10.1"
 xxhash = ">=3.0.0"
 pybloom-live = ">=4.0.0"
@@ -27,13 +26,15 @@ psutil = ">=5.9.8"
 fire = "^0.6.0"
 click = "^8.1.7"
 click-option-group = "^0.5.6"
+rich = "^13.7.1"
 
 [tool.poetry.group.dev.dependencies]
 pre-commit = "^3.6.2"
 insegel = "^1.3.1"
 pytest = "^8.0.2"
 coverage = "^7.4.3"
 ruff = "^0.3.2"
+tabulate = "^0.9.0"
 
 [build-system]
 requires = ["poetry-core"]