fix: streaming cache incremental chunks for cache hits + cache streaming responses #937

liavweiss · 2025-12-31T11:23:36Z

Fix: Incremental Streaming for Cached Responses + Streaming Response Caching

Problem

When a cached response was returned for a streaming request (stream: true), the router was sending the entire content in a single SSE chunk instead of incremental chunks. This broke the streaming UX because clients expected to receive content incrementally (word-by-word or token-by-token).

Additionally, streaming responses were not being cached, meaning subsequent identical streaming requests would hit the upstream LLM again instead of using the cache.

Issues Fixed

Single chunk instead of incremental streaming - Cached responses were sent as one large chunk
Error response handling - Cached error responses were improperly converted to streaming format
Malformed output - Empty choices arrays in SSE chunks when error responses were cached
Streaming responses not cached - Streaming responses were skipped from caching entirely

Solution

Core Changes

Word-by-word chunking: Split cached content into word-by-word chunks for incremental streaming
- Uses strings.Fields() to split content
- Preserves spaces between words
- Creates one SSE chunk per word
Error response detection: Check if cached response is an error BEFORE parsing
- Detects error responses by checking for error/detail fields and absence of choices
- Returns properly formatted SSE error chunks
Streaming response caching: Accumulate streaming chunks and cache complete response
- Accumulates SSE chunks in RequestContext during streaming
- Parses chunks to extract content, metadata, and usage information
- Reconstructs complete ChatCompletion when [DONE] marker received
- Caches only on normal completion (safety checks prevent caching incomplete/aborted streams)
- Uses AddEntry as fallback when AddPendingRequest fails
Improved error handling:
- No silent chunk skipping (adds error chunk if marshaling fails)
- Always ensures [DONE] marker is sent
- Fallback final chunk if marshaling fails
Performance improvements:
- Uses bytes.Buffer instead of strings.Join() for string building
- Single time.Now() call instead of multiple calls

Streaming Cache Safety

The streaming cache implementation includes multiple safety checks to ensure only complete, valid responses are cached:

Normal completion check: Only caches when [DONE] marker is received
Abort detection: Skips caching if stream was aborted (EOF, cancellation, timeout)
Content validation: Ensures accumulated content is not empty
Metadata validation: Verifies required fields (id, model) are present
Reconstruction validation: Validates reconstructed response structure before caching

Testing

Added comprehensive test coverage:

✅ TestCreateCacheHitResponse_Streaming - Updated to verify multiple chunks
✅ TestCreateCacheHitResponse_StreamingWithErrorResponse - Error response handling
✅ TestCreateCacheHitResponse_StreamingWithEmptyContent - Edge case
✅ TestCreateCacheHitResponse_StreamingWithEmptyChoices - Edge case
✅ TestCreateCacheHitResponse_StreamingWithWhitespaceContent - Edge case
✅ TestCreateCacheHitResponse_StreamingWithLongContent - Multiple chunks verification
✅ TestSplitContentIntoChunks - Direct unit tests for chunking function
✅ TestParseStreamingChunk - Unit tests for streaming chunk parsing

All tests pass ✅

Design Decision: Word-by-Word vs Tokenizer

I chose word-by-word splitting over tokenizer-based splitting for the following reasons:

Word-by-Word (Chosen)

✅ Better UX: Smooth, readable streaming (complete words appear)
✅ Simpler: No tokenizer dependency needed
✅ Fast: strings.Fields() is very efficient
✅ Good enough: Final accumulated result is identical
⚠️ Trade-off: Less accurate to how model generated text (but acceptable for cached responses)

Tokenizer (Alternative)

✅ More accurate: Matches exactly how model generated text (token-by-token)
⚠️ Worse UX: Can be choppy (subword splits like "Hell" → "o" → " wor" → "ld")
⚠️ More complex: Requires tokenizer dependency
⚠️ Slower: Tokenization adds overhead

Rationale: For cached responses, we already have the complete text. The priority is smooth UX over exact tokenization accuracy. If upstream LLM sends token-by-token, we should match that format, but for cached responses, word-by-word provides better user experience.

Related Issues

Fixes #913

Checklist

Tests added/updated
All tests pass
Code follows project style guidelines
Error handling improved
Performance optimizations applied
Documentation updated (code comments)
Streaming responses now cached safely

…ing responses Signed-off-by: Liav Weiss <[email protected]>

netlify · 2025-12-31T11:23:41Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`664e81c`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/695507bb48042900082174f0
😎 Deploy Preview	https://deploy-preview-937--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2025-12-31T11:46:10Z

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 `src`

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

src/semantic-router/pkg/extproc/processor_core.go
src/semantic-router/pkg/extproc/processor_req_header.go
src/semantic-router/pkg/extproc/processor_res_body.go
src/semantic-router/pkg/extproc/processor_res_body_streaming_test.go
src/semantic-router/pkg/utils/http/response.go
src/semantic-router/pkg/utils/http/response_test.go

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

Fix streaming cache: incremental chunks for cache hits + cache stream…

664e81c

…ing responses Signed-off-by: Liav Weiss <[email protected]>

github-actions bot assigned rootfs, wangchen615 and Xunzhuo Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: streaming cache incremental chunks for cache hits + cache streaming responses #937

fix: streaming cache incremental chunks for cache hits + cache streaming responses #937

Uh oh!

liavweiss commented Dec 31, 2025 •

edited

Loading

Uh oh!

netlify bot commented Dec 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: streaming cache incremental chunks for cache hits + cache streaming responses #937

Are you sure you want to change the base?

fix: streaming cache incremental chunks for cache hits + cache streaming responses #937

Uh oh!

Conversation

liavweiss commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Incremental Streaming for Cached Responses + Streaming Response Caching

Problem

Issues Fixed

Solution

Core Changes

Streaming Cache Safety

Testing

Design Decision: Word-by-Word vs Tokenizer

Word-by-Word (Chosen)

Tokenizer (Alternative)

Related Issues

Checklist

Uh oh!

netlify bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

github-actions bot commented Dec 31, 2025

👥 vLLM Semantic Team Notification

📁 src

🎉 Thanks for your contributions!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liavweiss commented Dec 31, 2025 •

edited

Loading

netlify bot commented Dec 31, 2025 •

edited

Loading

📁 `src`