fix: streaming cache incremental chunks for cache hits + cache streaming responses #937
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

Fix: Incremental Streaming for Cached Responses + Streaming Response Caching
Problem
When a cached response was returned for a streaming request (
stream: true), the router was sending the entire content in a single SSE chunk instead of incremental chunks. This broke the streaming UX because clients expected to receive content incrementally (word-by-word or token-by-token).Additionally, streaming responses were not being cached, meaning subsequent identical streaming requests would hit the upstream LLM again instead of using the cache.
Issues Fixed
choicesarrays in SSE chunks when error responses were cachedSolution
Core Changes
Word-by-word chunking: Split cached content into word-by-word chunks for incremental streaming
strings.Fields()to split contentError response detection: Check if cached response is an error BEFORE parsing
error/detailfields and absence ofchoicesStreaming response caching: Accumulate streaming chunks and cache complete response
RequestContextduring streamingChatCompletionwhen[DONE]marker receivedAddEntryas fallback whenAddPendingRequestfailsImproved error handling:
[DONE]marker is sentPerformance improvements:
bytes.Bufferinstead ofstrings.Join()for string buildingtime.Now()call instead of multiple callsStreaming Cache Safety
The streaming cache implementation includes multiple safety checks to ensure only complete, valid responses are cached:
[DONE]marker is receivedTesting
Added comprehensive test coverage:
TestCreateCacheHitResponse_Streaming- Updated to verify multiple chunksTestCreateCacheHitResponse_StreamingWithErrorResponse- Error response handlingTestCreateCacheHitResponse_StreamingWithEmptyContent- Edge caseTestCreateCacheHitResponse_StreamingWithEmptyChoices- Edge caseTestCreateCacheHitResponse_StreamingWithWhitespaceContent- Edge caseTestCreateCacheHitResponse_StreamingWithLongContent- Multiple chunks verificationTestSplitContentIntoChunks- Direct unit tests for chunking functionTestParseStreamingChunk- Unit tests for streaming chunk parsingAll tests pass ✅
Design Decision: Word-by-Word vs Tokenizer
I chose word-by-word splitting over tokenizer-based splitting for the following reasons:
Word-by-Word (Chosen)
strings.Fields()is very efficientTokenizer (Alternative)
Rationale: For cached responses, we already have the complete text. The priority is smooth UX over exact tokenization accuracy. If upstream LLM sends token-by-token, we should match that format, but for cached responses, word-by-word provides better user experience.
Related Issues
Fixes #913
Checklist