From 8b7190bbc718796caddf8afdec228eba6e07ed5f Mon Sep 17 00:00:00 2001 From: jonfritz <134336691+jonfritz@users.noreply.github.com> Date: Tue, 2 Apr 2024 00:28:10 -0700 Subject: [PATCH] Update dedup.md (#334) Changed some language and layout of the section. --- docs/source/querying_data/dedup.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/querying_data/dedup.md b/docs/source/querying_data/dedup.md index c2485a07e..408f65fcd 100644 --- a/docs/source/querying_data/dedup.md +++ b/docs/source/querying_data/dedup.md @@ -1,6 +1,8 @@ # De-Duplicating Query Results -As mentioned in [Remote Search Processors](remote_processors.md), the Sycamore near-duplicate detection (NDD) facility can be used to drop duplicates from query results. This is implemented as a remote search response processor called `dedup-response`. It's configured in `pipelines.yml` like so: +The Sycamore near-duplicate detection (NDD) feature can be used to drop duplicates from query results. It is implemented as a [Remote Search Processor](remote_processors.md) called `dedup-response`, and you can add this to your hybrid search or RAG search pipelines. A prerequisite for NDD is to have previously ingested the documents using the `Sketcher` Sycamore transform. See documentation for [sketch](../data_ingestion_and_preparation/transforms/sketch.md) in `DocSet` for details. Sycamore's defeault data ingestion and search pipeliens have sketching and NDD enabled. + +NDD is configured in `pipelines.yml` with a several of preset values: ```yaml - dedup00: @@ -41,13 +43,13 @@ As mentioned in [Remote Search Processors](remote_processors.md), the Sycamore n threshold: 0.55 ``` -As can be seen, there's one parameter, `threshold`, which controls how aggressively NDD will drop documents. Near 0.0, few documents will be removed and they will need to be practically identical to higher-scoring documents. Above 1.0, all documents will be removed, except for the first one. +As can be seen, there's one parameter, `threshold`, which controls how aggressively NDD will drop documents. Near 0.0, few documents will be removed and they will need to be practically identical to higher-scoring documents. Above 1.0, all documents will be removed, except for the first one. -The current implementation of NDD uses "shingles" which consist of 16 hash values. The distance between two documents is the number of hash values that differ between the two documents' shingles. The raw number is between 0 and 16, but we normalize it to between 0.0 and 1.0. The logic is basically: `if distance < threshold, drop`. Our default preset `dedup02` allows two hashes to differ. That would make the threshold 2 / 16, or 0.125, but we need to add a smidge because it's a less-than comparison. So, we round to 0.15. +The current implementation of NDD uses "shingles" which consist of 16 hash values. The distance between two documents is the number of hash values that differ between the two documents' shingles. The raw number is between 0 and 16, but we normalize it to between 0.0 and 1.0. The logic is basically: `if distance < threshold, drop the result`. -A prerequisite for query-time NDD is to have previously ingested the documents using the `Sketcher` Sycamore transform. See documentation for [sketch](../data_ingestion_and_preparation/transforms/sketch.md) in `DocSet` for details. +Sycamore's default hybrid search and RAG pipelines use `dedup02`, which allows two hashes to differ. That would make the threshold 2 / 16, or 0.125, but we need to set the value slightly higher because it uses a less-than comparison. -In order for the `dedup-response` processor to do its job, it must be able to access the `shingles` field of each retrieved document. This will happen by default if the OpenSearch query does not specify `_source`. Otherwise, it needs to be listed specifically like so: +The `dedup-response` processor requires the`shingles` field of each retrieved document in the hybrid search step of a pipeline. This will happen by default if the OpenSearch query does not specify `_source`. Otherwise, it needs to be specified directly: ``` { @@ -62,5 +64,3 @@ In order for the `dedup-response` processor to do its job, it must be able to ac } } ``` - -The practical effect of this is that it's possible to enable or disable NDD by controlling the contents of `_source`.