[FEATURE] Retrieve specific chunk from chunked documents #1188

yuye-aws · 2025-02-16T08:22:45Z

Is your feature request related to a problem?

The text chunking ingestion processor has been released since OpenSearch 2.13. Users can follow the tutorials (pre-trained models and text chunking) to perform text embedding upon each chunked segment. The search query specified in the text chunking documentation, however, can only return the matched document instead of segment. In some use cases where the user is using retrieval_augmented_generation processor (see this issue for details), the retrieved text needs to be as concise as possible so that it won’t exceed the context limitation of the LLM.

What solution would you like?

The neural-search plugins should allow users to retrieved the specific chunked segment.

What alternatives have you considered?

Before providing any specific solutions, I would like to introduce a workaround. Thanks to @heemin32 (see this RFC for details), by specifying the parameter expand_nested_docs to be true, users can return the scores for each chunked segment. It’s recommended to post process the results and obtain the most relevant chunk.

// Query
GET testindex/_search
{
  "query": {
    "nested": {
      "inner_hits": {},
      "score_mode": "avg",
      "path": "passage_chunk_embedding",
      "query": {
        "neural": {
          "passage_chunk_embedding.knn": {
            "query_text": "document",
            "model_id": "3WK6DZUBFnFF_ZrLTpry",
            "expand_nested_docs": true
          }
        }
      }
    }
  }
}


// The inner_hits retrieved results
{
    "took": 50,
    "timed_out": false,
    "_shards": {
      "total": 1,
      "successful": 1,
      "skipped": 0,
      "failed": 0
    },
    "hits": {
      "total": {
        "value": 1,
        "relation": "eq"
      },
      "max_score": 0.01699731,
      "hits": [
        {
          "_index": "testindex",
          "_id": "4WK9DZUBFnFF_ZrLFZpi",
          "_score": 0.01699731,
          "_source": {
            "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
            "passage_chunk": [
              "This is an example document to be chunked. The document ",
              "The document contains a single paragraph, two sentences and 24 ",
              "and 24 tokens by standard tokenizer in OpenSearch."
            ],
            "passage_chunk_embedding": [
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              }
            ]
          },
          "inner_hits": {
            "passage_chunk_embedding": {
              "hits": {
                "total": {
                  "value": 3,
                  "relation": "eq"
                },
                "max_score": 0.022762492,
                "hits": [
                  {
                    "_index": "testindex",
                    "_id": "4WK9DZUBFnFF_ZrLFZpi",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 1
                    },
                    "_score": 0.022762492,
                    "_source": {
                      "knn": [ ... ]
                    }
                  },
                  {
                    "_index": "testindex",
                    "_id": "4WK9DZUBFnFF_ZrLFZpi",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 0
                    },
                    "_score": 0.016283773,
                    "_source": {
                      "knn": [ ... ]
                    }
                  },
                  {
                    "_index": "testindex",
                    "_id": "4WK9DZUBFnFF_ZrLFZpi",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 2
                    },
                    "_score": 0.011945663,
                    "_source": {
                      "knn": [ ... ]
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }

Do you have any additional context?

The followings lists some related customer issues.

The text was updated successfully, but these errors were encountered:

yuye-aws · 2025-02-16T08:27:36Z

#1177 presents one of the solutions of this feature request.

heemin32 · 2025-02-16T15:47:00Z

@yuye-aws Thanks for sharing the alternative!

yuye-aws added enhancement untriaged labels Feb 16, 2025

yuye-aws mentioned this issue Feb 16, 2025

[FEATURE] Enable to use passage chunks from hybrid neural search result as RAG input opensearch-project/ml-commons#2612

Open

yuye-aws removed the untriaged label Feb 16, 2025

yuye-aws mentioned this issue Feb 16, 2025

[FEATURE]Combine text chunking and text embedding output to a single nested field #1177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Retrieve specific chunk from chunked documents #1188

[FEATURE] Retrieve specific chunk from chunked documents #1188

yuye-aws commented Feb 16, 2025 •

edited

Loading

yuye-aws commented Feb 16, 2025

heemin32 commented Feb 16, 2025

[FEATURE] Retrieve specific chunk from chunked documents #1188

[FEATURE] Retrieve specific chunk from chunked documents #1188

Comments

yuye-aws commented Feb 16, 2025 • edited Loading

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

yuye-aws commented Feb 16, 2025

heemin32 commented Feb 16, 2025

yuye-aws commented Feb 16, 2025 •

edited

Loading