Feature Request: Make Vector Store File Upload Chunking Strategy Configurable

# Vector Store Chunking Strategy

## Summary

Currently, when uploading files to vector stores via the `vector_stores.files.create()` API, the chunking strategy is hardcoded to default values (800 tokens max, 400 token overlap) even when a custom `chunking_strategy` parameter is provided. This forces users to manually pre-chunk their documents to match these fixed parameters, which is inflexible and prevents optimal use of different embedding models with varying token limits.

## Problem

### Current Behavior

1. **Hardcoded defaults**: The `chunking_strategy` parameter in `vector_stores.files.create()` is currently ignored or not properly passed through to the chunking logic.

2. **Fixed chunk size**: Files are always chunked at 800 tokens with 400 token overlap, regardless of:
   - The embedding model's token limit (e.g., `all-minilm:l6-v2` has 256 token limit, `nomic-embed-text` has 8,192 token limit)
   - User preferences for chunk size
   - Document structure and content type
   
3. **Forces manual pre-chunking**: Users must manually chunk their documents before upload to work around this limitation, defeating the purpose of the automatic chunking feature.

### Code References

**API Definition** (`src/llama_stack/apis/vector_io/vector_io.py:294-314`):

```python
@json_schema_type
class VectorStoreChunkingStrategyStaticConfig(BaseModel):
    """Configuration for static chunking strategy.
    
    :param chunk_overlap_tokens: Number of tokens to overlap between adjacent chunks
    :param max_chunk_size_tokens: Maximum number of tokens per chunk, must be between 100 and 4096
    """
    
    chunk_overlap_tokens: int = 400
    max_chunk_size_tokens: int = Field(800, ge=100, le=4096)
```

**Implementation** (`src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py:773-779`):

```python
if isinstance(chunking_strategy, VectorStoreChunkingStrategyStatic):
    max_chunk_size_tokens = chunking_strategy.static.max_chunk_size_tokens
    chunk_overlap_tokens = chunking_strategy.static.chunk_overlap_tokens
else:
    # Default values from OpenAI API spec
    max_chunk_size_tokens = 800
    chunk_overlap_tokens = 400
```

**Issue**: The `chunking_strategy` parameter is defined but not properly exposed in the client API call.

## Impact

### Silent Data Loss with Small Token Limit Models

When using embedding models with token limits smaller than the default 800-token chunk size (e.g., `all-minilm:l6-v2` with 256 token limit):

- **The embedding model silently truncates chunks to fit its token limit**
- **Example**: With `all-minilm:l6-v2`, each 800-token chunk gets truncated to 256 tokens, losing 544 tokens (68% of the content)
- **This data loss is silent** - no errors or warnings are raised
- **Retrieval quality suffers** because embeddings only represent a fraction of each chunk's content

### Suboptimal Performance with Large Token Limit Models

When using embedding models with large token limits (e.g., `nomic-embed-text` with 8,192 token limit):

- The default 800-token chunks work, but prevent optimization for advanced techniques like contextual retrieval
- **Example**: Contextual retrieval requires smaller chunks (~700 tokens) to leave room for context (~70-100 tokens) while staying within the 800 token window
- Users cannot optimize chunk size for their specific use case or document structure

### User Workarounds Required

Users currently must:

1. Manually chunk documents before upload
2. Create separate files for each chunk
3. Upload each chunk individually
4. Manage chunk metadata manually

This defeats the purpose of having automatic chunking in the vector store API.

## Proposed Solution

### 1. Expose `chunking_strategy` Parameter in Client API

**Current API** (doesn't accept chunking_strategy):

```python
client.vector_stores.files.create(
    vector_store_id=vector_store_id,
    file_id=file_response.id,
    # chunking_strategy parameter is not available!
)
```

### 2. Make Defaults Configurable Per Vector Store

Allow vector stores to define default chunking strategies based on their embedding model:

```python
client.vector_stores.create(
    name="my_vector_store",
    metadata={"purpose": "model_cards"},
    extra_body={
        "embedding_model": "ollama/nomic-embed-text:latest",
        "embedding_dimension": 768,
        "provider_id": "faiss",
        "default_chunking_strategy": {
            "type": "static",
            "static": {
                "max_chunk_size_tokens": 700,
                "chunk_overlap_tokens": 100
            }
        }
    }
)
```

### 3. Auto-detect Optimal Chunk Size Based on Embedding Model

Optionally, automatically set chunk size based on the embedding model's token limit:

```python
EMBEDDING_MODEL_LIMITS = {
    "all-minilm:l6-v2": 256,
    "nomic-embed-text": 8192,
    "text-embedding-ada-002": 8191,
    # ... etc
}

def get_optimal_chunk_size(embedding_model: str) -> int:
    """Get optimal chunk size as 80% of model's token limit."""
    limit = EMBEDDING_MODEL_LIMITS.get(embedding_model, 800)
    return int(limit * 0.8)  # Leave 20% buffer
```

## Benefits

1. **Flexibility**: Users can optimize chunk size for their specific embedding model
2. **No truncation**: Prevents silent data loss from exceeding token limits
3. **Better retrieval**: Allows optimization for contextual retrieval techniques
4. **Simpler code**: Eliminates need for manual pre-chunking workarounds
5. **OpenAI compatibility**: Matches OpenAI's vector store API behavior

## Related Issues

- Contextual Retrieval implementation requires custom chunk sizes
- Embedding model token limit mismatches cause silent truncation
- Users need to manually pre-chunk documents as workaround

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Make Vector Store File Upload Chunking Strategy Configurable #4021

Vector Store Chunking Strategy

Summary

Problem

Current Behavior

Code References

Impact

Silent Data Loss with Small Token Limit Models

Suboptimal Performance with Large Token Limit Models

User Workarounds Required

Proposed Solution

1. Expose `chunking_strategy` Parameter in Client API

2. Make Defaults Configurable Per Vector Store

3. Auto-detect Optimal Chunk Size Based on Embedding Model

Benefits

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Make Vector Store File Upload Chunking Strategy Configurable #4021

Description

Vector Store Chunking Strategy

Summary

Problem

Current Behavior

Code References

Impact

Silent Data Loss with Small Token Limit Models

Suboptimal Performance with Large Token Limit Models

User Workarounds Required

Proposed Solution

1. Expose chunking_strategy Parameter in Client API

2. Make Defaults Configurable Per Vector Store

3. Auto-detect Optimal Chunk Size Based on Embedding Model

Benefits

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Expose `chunking_strategy` Parameter in Client API