Skip to content

Conversation

@liavweiss
Copy link
Contributor

@liavweiss liavweiss commented Dec 31, 2025

Fix: Incremental Streaming for Cached Responses + Streaming Response Caching

Problem

When a cached response was returned for a streaming request (stream: true), the router was sending the entire content in a single SSE chunk instead of incremental chunks. This broke the streaming UX because clients expected to receive content incrementally (word-by-word or token-by-token).

Additionally, streaming responses were not being cached, meaning subsequent identical streaming requests would hit the upstream LLM again instead of using the cache.

Issues Fixed

  1. Single chunk instead of incremental streaming - Cached responses were sent as one large chunk
  2. Error response handling - Cached error responses were improperly converted to streaming format
  3. Malformed output - Empty choices arrays in SSE chunks when error responses were cached
  4. Streaming responses not cached - Streaming responses were skipped from caching entirely

Solution

Core Changes

  1. Word-by-word chunking: Split cached content into word-by-word chunks for incremental streaming

    • Uses strings.Fields() to split content
    • Preserves spaces between words
    • Creates one SSE chunk per word
  2. Error response detection: Check if cached response is an error BEFORE parsing

    • Detects error responses by checking for error/detail fields and absence of choices
    • Returns properly formatted SSE error chunks
  3. Streaming response caching: Accumulate streaming chunks and cache complete response

    • Accumulates SSE chunks in RequestContext during streaming
    • Parses chunks to extract content, metadata, and usage information
    • Reconstructs complete ChatCompletion when [DONE] marker received
    • Caches only on normal completion (safety checks prevent caching incomplete/aborted streams)
    • Uses AddEntry as fallback when AddPendingRequest fails
  4. Improved error handling:

    • No silent chunk skipping (adds error chunk if marshaling fails)
    • Always ensures [DONE] marker is sent
    • Fallback final chunk if marshaling fails
  5. Performance improvements:

    • Uses bytes.Buffer instead of strings.Join() for string building
    • Single time.Now() call instead of multiple calls

Streaming Cache Safety

The streaming cache implementation includes multiple safety checks to ensure only complete, valid responses are cached:

  1. Normal completion check: Only caches when [DONE] marker is received
  2. Abort detection: Skips caching if stream was aborted (EOF, cancellation, timeout)
  3. Content validation: Ensures accumulated content is not empty
  4. Metadata validation: Verifies required fields (id, model) are present
  5. Reconstruction validation: Validates reconstructed response structure before caching

Testing

Added comprehensive test coverage:

  • TestCreateCacheHitResponse_Streaming - Updated to verify multiple chunks
  • TestCreateCacheHitResponse_StreamingWithErrorResponse - Error response handling
  • TestCreateCacheHitResponse_StreamingWithEmptyContent - Edge case
  • TestCreateCacheHitResponse_StreamingWithEmptyChoices - Edge case
  • TestCreateCacheHitResponse_StreamingWithWhitespaceContent - Edge case
  • TestCreateCacheHitResponse_StreamingWithLongContent - Multiple chunks verification
  • TestSplitContentIntoChunks - Direct unit tests for chunking function
  • TestParseStreamingChunk - Unit tests for streaming chunk parsing

All tests pass ✅

Design Decision: Word-by-Word vs Tokenizer

I chose word-by-word splitting over tokenizer-based splitting for the following reasons:

Word-by-Word (Chosen)

  • Better UX: Smooth, readable streaming (complete words appear)
  • Simpler: No tokenizer dependency needed
  • Fast: strings.Fields() is very efficient
  • Good enough: Final accumulated result is identical
  • ⚠️ Trade-off: Less accurate to how model generated text (but acceptable for cached responses)

Tokenizer (Alternative)

  • More accurate: Matches exactly how model generated text (token-by-token)
  • ⚠️ Worse UX: Can be choppy (subword splits like "Hell" → "o" → " wor" → "ld")
  • ⚠️ More complex: Requires tokenizer dependency
  • ⚠️ Slower: Tokenization adds overhead

Rationale: For cached responses, we already have the complete text. The priority is smooth UX over exact tokenization accuracy. If upstream LLM sends token-by-token, we should match that format, but for cached responses, word-by-word provides better user experience.

Related Issues

Fixes #913

Checklist

  • Tests added/updated
  • All tests pass
  • Code follows project style guidelines
  • Error handling improved
  • Performance optimizations applied
  • Documentation updated (code comments)
  • Streaming responses now cached safely

@netlify
Copy link

netlify bot commented Dec 31, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 664e81c
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/695507bb48042900082174f0
😎 Deploy Preview https://deploy-preview-937--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 src

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

  • src/semantic-router/pkg/extproc/processor_core.go
  • src/semantic-router/pkg/extproc/processor_req_header.go
  • src/semantic-router/pkg/extproc/processor_res_body.go
  • src/semantic-router/pkg/extproc/processor_res_body_streaming_test.go
  • src/semantic-router/pkg/utils/http/response.go
  • src/semantic-router/pkg/utils/http/response_test.go

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Streaming + semantic_cache: cache hit breaks SSE (returns !!!!...) or misses and forwards to upstream

4 participants