Skip to content

Feature Request: Make Vector Store File Upload Chunking Strategy Configurable #4021

@zanetworker

Description

@zanetworker

Vector Store Chunking Strategy

Summary

Currently, when uploading files to vector stores via the vector_stores.files.create() API, the chunking strategy is hardcoded to default values (800 tokens max, 400 token overlap) even when a custom chunking_strategy parameter is provided. This forces users to manually pre-chunk their documents to match these fixed parameters, which is inflexible and prevents optimal use of different embedding models with varying token limits.

Problem

Current Behavior

  1. Hardcoded defaults: The chunking_strategy parameter in vector_stores.files.create() is currently ignored or not properly passed through to the chunking logic.

  2. Fixed chunk size: Files are always chunked at 800 tokens with 400 token overlap, regardless of:

    • The embedding model's token limit (e.g., all-minilm:l6-v2 has 256 token limit, nomic-embed-text has 8,192 token limit)
    • User preferences for chunk size
    • Document structure and content type
  3. Forces manual pre-chunking: Users must manually chunk their documents before upload to work around this limitation, defeating the purpose of the automatic chunking feature.

Code References

API Definition (src/llama_stack/apis/vector_io/vector_io.py:294-314):

@json_schema_type
class VectorStoreChunkingStrategyStaticConfig(BaseModel):
    """Configuration for static chunking strategy.
    
    :param chunk_overlap_tokens: Number of tokens to overlap between adjacent chunks
    :param max_chunk_size_tokens: Maximum number of tokens per chunk, must be between 100 and 4096
    """
    
    chunk_overlap_tokens: int = 400
    max_chunk_size_tokens: int = Field(800, ge=100, le=4096)

Implementation (src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py:773-779):

if isinstance(chunking_strategy, VectorStoreChunkingStrategyStatic):
    max_chunk_size_tokens = chunking_strategy.static.max_chunk_size_tokens
    chunk_overlap_tokens = chunking_strategy.static.chunk_overlap_tokens
else:
    # Default values from OpenAI API spec
    max_chunk_size_tokens = 800
    chunk_overlap_tokens = 400

Issue: The chunking_strategy parameter is defined but not properly exposed in the client API call.

Impact

Silent Data Loss with Small Token Limit Models

When using embedding models with token limits smaller than the default 800-token chunk size (e.g., all-minilm:l6-v2 with 256 token limit):

  • The embedding model silently truncates chunks to fit its token limit
  • Example: With all-minilm:l6-v2, each 800-token chunk gets truncated to 256 tokens, losing 544 tokens (68% of the content)
  • This data loss is silent - no errors or warnings are raised
  • Retrieval quality suffers because embeddings only represent a fraction of each chunk's content

Suboptimal Performance with Large Token Limit Models

When using embedding models with large token limits (e.g., nomic-embed-text with 8,192 token limit):

  • The default 800-token chunks work, but prevent optimization for advanced techniques like contextual retrieval
  • Example: Contextual retrieval requires smaller chunks (~700 tokens) to leave room for context (~70-100 tokens) while staying within the 800 token window
  • Users cannot optimize chunk size for their specific use case or document structure

User Workarounds Required

Users currently must:

  1. Manually chunk documents before upload
  2. Create separate files for each chunk
  3. Upload each chunk individually
  4. Manage chunk metadata manually

This defeats the purpose of having automatic chunking in the vector store API.

Proposed Solution

1. Expose chunking_strategy Parameter in Client API

Current API (doesn't accept chunking_strategy):

client.vector_stores.files.create(
    vector_store_id=vector_store_id,
    file_id=file_response.id,
    # chunking_strategy parameter is not available!
)

2. Make Defaults Configurable Per Vector Store

Allow vector stores to define default chunking strategies based on their embedding model:

client.vector_stores.create(
    name="my_vector_store",
    metadata={"purpose": "model_cards"},
    extra_body={
        "embedding_model": "ollama/nomic-embed-text:latest",
        "embedding_dimension": 768,
        "provider_id": "faiss",
        "default_chunking_strategy": {
            "type": "static",
            "static": {
                "max_chunk_size_tokens": 700,
                "chunk_overlap_tokens": 100
            }
        }
    }
)

3. Auto-detect Optimal Chunk Size Based on Embedding Model

Optionally, automatically set chunk size based on the embedding model's token limit:

EMBEDDING_MODEL_LIMITS = {
    "all-minilm:l6-v2": 256,
    "nomic-embed-text": 8192,
    "text-embedding-ada-002": 8191,
    # ... etc
}

def get_optimal_chunk_size(embedding_model: str) -> int:
    """Get optimal chunk size as 80% of model's token limit."""
    limit = EMBEDDING_MODEL_LIMITS.get(embedding_model, 800)
    return int(limit * 0.8)  # Leave 20% buffer

Benefits

  1. Flexibility: Users can optimize chunk size for their specific embedding model
  2. No truncation: Prevents silent data loss from exceeding token limits
  3. Better retrieval: Allows optimization for contextual retrieval techniques
  4. Simpler code: Eliminates need for manual pre-chunking workarounds
  5. OpenAI compatibility: Matches OpenAI's vector store API behavior

Related Issues

  • Contextual Retrieval implementation requires custom chunk sizes
  • Embedding model token limit mismatches cause silent truncation
  • Users need to manually pre-chunk documents as workaround

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions