-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Vector Store Chunking Strategy
Summary
Currently, when uploading files to vector stores via the vector_stores.files.create() API, the chunking strategy is hardcoded to default values (800 tokens max, 400 token overlap) even when a custom chunking_strategy parameter is provided. This forces users to manually pre-chunk their documents to match these fixed parameters, which is inflexible and prevents optimal use of different embedding models with varying token limits.
Problem
Current Behavior
-
Hardcoded defaults: The
chunking_strategyparameter invector_stores.files.create()is currently ignored or not properly passed through to the chunking logic. -
Fixed chunk size: Files are always chunked at 800 tokens with 400 token overlap, regardless of:
- The embedding model's token limit (e.g.,
all-minilm:l6-v2has 256 token limit,nomic-embed-texthas 8,192 token limit) - User preferences for chunk size
- Document structure and content type
- The embedding model's token limit (e.g.,
-
Forces manual pre-chunking: Users must manually chunk their documents before upload to work around this limitation, defeating the purpose of the automatic chunking feature.
Code References
API Definition (src/llama_stack/apis/vector_io/vector_io.py:294-314):
@json_schema_type
class VectorStoreChunkingStrategyStaticConfig(BaseModel):
"""Configuration for static chunking strategy.
:param chunk_overlap_tokens: Number of tokens to overlap between adjacent chunks
:param max_chunk_size_tokens: Maximum number of tokens per chunk, must be between 100 and 4096
"""
chunk_overlap_tokens: int = 400
max_chunk_size_tokens: int = Field(800, ge=100, le=4096)Implementation (src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py:773-779):
if isinstance(chunking_strategy, VectorStoreChunkingStrategyStatic):
max_chunk_size_tokens = chunking_strategy.static.max_chunk_size_tokens
chunk_overlap_tokens = chunking_strategy.static.chunk_overlap_tokens
else:
# Default values from OpenAI API spec
max_chunk_size_tokens = 800
chunk_overlap_tokens = 400Issue: The chunking_strategy parameter is defined but not properly exposed in the client API call.
Impact
Silent Data Loss with Small Token Limit Models
When using embedding models with token limits smaller than the default 800-token chunk size (e.g., all-minilm:l6-v2 with 256 token limit):
- The embedding model silently truncates chunks to fit its token limit
- Example: With
all-minilm:l6-v2, each 800-token chunk gets truncated to 256 tokens, losing 544 tokens (68% of the content) - This data loss is silent - no errors or warnings are raised
- Retrieval quality suffers because embeddings only represent a fraction of each chunk's content
Suboptimal Performance with Large Token Limit Models
When using embedding models with large token limits (e.g., nomic-embed-text with 8,192 token limit):
- The default 800-token chunks work, but prevent optimization for advanced techniques like contextual retrieval
- Example: Contextual retrieval requires smaller chunks (~700 tokens) to leave room for context (~70-100 tokens) while staying within the 800 token window
- Users cannot optimize chunk size for their specific use case or document structure
User Workarounds Required
Users currently must:
- Manually chunk documents before upload
- Create separate files for each chunk
- Upload each chunk individually
- Manage chunk metadata manually
This defeats the purpose of having automatic chunking in the vector store API.
Proposed Solution
1. Expose chunking_strategy Parameter in Client API
Current API (doesn't accept chunking_strategy):
client.vector_stores.files.create(
vector_store_id=vector_store_id,
file_id=file_response.id,
# chunking_strategy parameter is not available!
)2. Make Defaults Configurable Per Vector Store
Allow vector stores to define default chunking strategies based on their embedding model:
client.vector_stores.create(
name="my_vector_store",
metadata={"purpose": "model_cards"},
extra_body={
"embedding_model": "ollama/nomic-embed-text:latest",
"embedding_dimension": 768,
"provider_id": "faiss",
"default_chunking_strategy": {
"type": "static",
"static": {
"max_chunk_size_tokens": 700,
"chunk_overlap_tokens": 100
}
}
}
)3. Auto-detect Optimal Chunk Size Based on Embedding Model
Optionally, automatically set chunk size based on the embedding model's token limit:
EMBEDDING_MODEL_LIMITS = {
"all-minilm:l6-v2": 256,
"nomic-embed-text": 8192,
"text-embedding-ada-002": 8191,
# ... etc
}
def get_optimal_chunk_size(embedding_model: str) -> int:
"""Get optimal chunk size as 80% of model's token limit."""
limit = EMBEDDING_MODEL_LIMITS.get(embedding_model, 800)
return int(limit * 0.8) # Leave 20% bufferBenefits
- Flexibility: Users can optimize chunk size for their specific embedding model
- No truncation: Prevents silent data loss from exceeding token limits
- Better retrieval: Allows optimization for contextual retrieval techniques
- Simpler code: Eliminates need for manual pre-chunking workarounds
- OpenAI compatibility: Matches OpenAI's vector store API behavior
Related Issues
- Contextual Retrieval implementation requires custom chunk sizes
- Embedding model token limit mismatches cause silent truncation
- Users need to manually pre-chunk documents as workaround