fix(server): grow exact prefix-cache snapshot chains#468
Open
dusterbloom wants to merge 7 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
1 issue found across 3 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Contributor
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
6f61460 to
8a64d57
Compare
Contributor
There was a problem hiding this comment.
2 issues found across 11 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
08c39a0 to
cbe5cc6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
This fixes inline prefix-cache chaining for replay-style agentic conversations:
key_len == snapshot_lenWhy
The agentic replay failure mode was repeated re-prefill from stale ancestor snapshots. The cache could stop discovering later boundaries after marker-like content, and even when later boundaries existed
prepare_inline_snap()selected the second-to-last cut. That prevented the snapshot chain from advancing turn by turn.While validating that chain growth, we found a correctness trap: the inline cache could publish a longer logical key backed by a shorter physical backend snapshot. That hides residual prefill behind a cache hit. This PR makes that impossible by construction: an inline cache hit is only published when the backend snapshot physically materializes the same token position as the cache key.
The Qwen35 backend change belongs in this PR because the server-side exact-key guard otherwise correctly refuses rounded snapshots and the real replay falls back to cold prefill every turn. The backend now splits only the ubatch that contains the requested inline boundary, saves exactly at that boundary, and then resumes on the same absolute ubatch grid. That keeps the cache invariant exact without shifting the rest of the prefill stream.
Validation
Unit:
Result:
20-turn real-session replay:
Result on the final candidate binary
04477072212e30ed...:Invariant log check:
No
refusing inline-snap, restore misalignment, mismatch, or error lines were present in the final replay log.Scope
This PR fixes the exact inline prefix-cache chain used by Qwen35 dense/all-hot replay paths. It does not try to recover the separate decode-speed optimizations, pFlash/FlowKV, spec decode, or constrained MoE hybrid snapshot path; those should stay in separate performance PRs and be validated by the combined replay/deathmatch matrix.
Supersedes #467, which GitHub auto-closed when the head branch was renamed to match current conventions.