Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Retrieve specific chunk from chunked documents #1188

Open
yuye-aws opened this issue Feb 16, 2025 · 2 comments
Open

[FEATURE] Retrieve specific chunk from chunked documents #1188

yuye-aws opened this issue Feb 16, 2025 · 2 comments

Comments

@yuye-aws
Copy link
Member

yuye-aws commented Feb 16, 2025

Is your feature request related to a problem?

The text chunking ingestion processor has been released since OpenSearch 2.13. Users can follow the tutorials (pre-trained models and text chunking) to perform text embedding upon each chunked segment. The search query specified in the text chunking documentation, however, can only return the matched document instead of segment. In some use cases where the user is using retrieval_augmented_generation processor (see this issue for details), the retrieved text needs to be as concise as possible so that it won’t exceed the context limitation of the LLM.

What solution would you like?

The neural-search plugins should allow users to retrieved the specific chunked segment.

What alternatives have you considered?

Before providing any specific solutions, I would like to introduce a workaround. Thanks to @heemin32 (see this RFC for details), by specifying the parameter expand_nested_docs to be true, users can return the scores for each chunked segment. It’s recommended to post process the results and obtain the most relevant chunk.

// Query
GET testindex/_search
{
  "query": {
    "nested": {
      "inner_hits": {},
      "score_mode": "avg",
      "path": "passage_chunk_embedding",
      "query": {
        "neural": {
          "passage_chunk_embedding.knn": {
            "query_text": "document",
            "model_id": "3WK6DZUBFnFF_ZrLTpry",
            "expand_nested_docs": true
          }
        }
      }
    }
  }
}


// The inner_hits retrieved results
{
    "took": 50,
    "timed_out": false,
    "_shards": {
      "total": 1,
      "successful": 1,
      "skipped": 0,
      "failed": 0
    },
    "hits": {
      "total": {
        "value": 1,
        "relation": "eq"
      },
      "max_score": 0.01699731,
      "hits": [
        {
          "_index": "testindex",
          "_id": "4WK9DZUBFnFF_ZrLFZpi",
          "_score": 0.01699731,
          "_source": {
            "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
            "passage_chunk": [
              "This is an example document to be chunked. The document ",
              "The document contains a single paragraph, two sentences and 24 ",
              "and 24 tokens by standard tokenizer in OpenSearch."
            ],
            "passage_chunk_embedding": [
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              }
            ]
          },
          "inner_hits": {
            "passage_chunk_embedding": {
              "hits": {
                "total": {
                  "value": 3,
                  "relation": "eq"
                },
                "max_score": 0.022762492,
                "hits": [
                  {
                    "_index": "testindex",
                    "_id": "4WK9DZUBFnFF_ZrLFZpi",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 1
                    },
                    "_score": 0.022762492,
                    "_source": {
                      "knn": [ ... ]
                    }
                  },
                  {
                    "_index": "testindex",
                    "_id": "4WK9DZUBFnFF_ZrLFZpi",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 0
                    },
                    "_score": 0.016283773,
                    "_source": {
                      "knn": [ ... ]
                    }
                  },
                  {
                    "_index": "testindex",
                    "_id": "4WK9DZUBFnFF_ZrLFZpi",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 2
                    },
                    "_score": 0.011945663,
                    "_source": {
                      "knn": [ ... ]
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }

Do you have any additional context?

The followings lists some related customer issues.

  1. [FEATURE] Enable to use passage chunks from hybrid neural search result as RAG input
  2. How to get score per chunk so that i can retrieve the most relevant chunk from the document?
  3. [FEATURE]Combine text chunking and text embedding output to a single nested field
@yuye-aws
Copy link
Member Author

#1177 presents one of the solutions of this feature request.

@heemin32
Copy link
Collaborator

@yuye-aws Thanks for sharing the alternative!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants