fix(server): grow exact prefix-cache snapshot chains by dusterbloom · Pull Request #468 · Luce-Org/lucebox-hub

dusterbloom · 2026-06-30T14:33:43Z

What changed

This fixes inline prefix-cache chaining for replay-style agentic conversations:

grow the in-memory snapshot chain at the newest valid boundary instead of reusing stale ancestors
track the backend snapshot length separately from the logical cache key during publication
require published inline entries to satisfy key_len == snapshot_len
reject/skip shorter physical snapshots instead of publishing longer logical aliases
make Qwen35 materialize inline snapshots exactly at the requested chat boundary, preserving the absolute ubatch phase after restore
add model-free unit coverage for boundary recovery, exact-prefix chain growth, alias rejection, and mismatched snapshot rejection

Why

The agentic replay failure mode was repeated re-prefill from stale ancestor snapshots. The cache could stop discovering later boundaries after marker-like content, and even when later boundaries existed prepare_inline_snap() selected the second-to-last cut. That prevented the snapshot chain from advancing turn by turn.

While validating that chain growth, we found a correctness trap: the inline cache could publish a longer logical key backed by a shorter physical backend snapshot. That hides residual prefill behind a cache hit. This PR makes that impossible by construction: an inline cache hit is only published when the backend snapshot physically materializes the same token position as the cache key.

The Qwen35 backend change belongs in this PR because the server-side exact-key guard otherwise correctly refuses rounded snapshots and the real replay falls back to cold prefill every turn. The backend now splits only the ubatch that contains the requested inline boundary, saves exactly at that boundary, and then resumes on the same absolute ubatch grid. That keeps the cache invariant exact without shifting the rest of the prefill stream.

Validation

Unit:

cmake --build server/build-pr-cuda126-sm86 --target test_server_unit -j 8
server/build-pr-cuda126-sm86/test_server_unit

Result:

2048 assertions, 0 failures

20-turn real-session replay:

env LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib \
python3 bench/abc_cache_harness/replay_harness.py \
  --arm AR_35B_KVF_PC64 \
  --trace bench/abc_cache_harness/traces/real_session_long.jsonl \
  --turn-limit 20 --n 1 --max-ctx 90000 \
  --binary .claude/worktrees/prefix-cache-agentic-chain/server/build-pr-cuda126-sm86/dflash_server \
  --port 19172

Result on the final candidate binary 04477072212e30ed...:

total_wall_s              : 63.4
sum_fresh_prefill_tokens  : 64359
mean_cache_hit_ratio      : 0.916
mean_decode_tps           : 92.84
spec_engagement_rate      : 0.0

Invariant log check:

[pc] inline-snap committed slot=0 key_len=31919 snapshot_len=31919
[pc] lookup hit slot=0 key_len=31919 snapshot_len=31919 (of 34113 total)
...
[pc] inline-snap committed slot=19 key_len=64239 snapshot_len=64239

No refusing inline-snap, restore misalignment, mismatch, or error lines were present in the final replay log.

Scope

This PR fixes the exact inline prefix-cache chain used by Qwen35 dense/all-hot replay paths. It does not try to recover the separate decode-speed optimizations, pFlash/FlowKV, spec decode, or constrained MoE hybrid snapshot path; those should stay in separate performance PRs and be validated by the combined replay/deathmatch matrix.

Supersedes #467, which GitHub auto-closed when the head branch was renamed to match current conventions.

cubic-dev-ai

1 issue found across 3 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai

1 issue found across 4 files (changes from recent commits).

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai

2 issues found across 11 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

dusterbloom marked this pull request as ready for review June 30, 2026 17:53

cubic-dev-ai Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread server/src/server/prefix_cache.cpp

cubic-dev-ai Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread server/src/server/prefix_cache.cpp

dusterbloom mentioned this pull request Jul 1, 2026

perf(qwen35): recover dense AR replay fast path #476

Draft

dusterbloom force-pushed the fix/prefix-cache-agentic-chain branch from 6f61460 to 8a64d57 Compare July 1, 2026 14:41

dusterbloom changed the title ~~fix(server): grow prefix-cache snapshot chains~~ fix(server): grow exact prefix-cache snapshot chains Jul 1, 2026

cubic-dev-ai Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread server/src/internal.h Outdated

Comment thread server/src/qwen35/graph_builders.cpp Outdated

dusterbloom added 6 commits July 3, 2026 11:24

fix(prefix-cache): grow agentic snapshot chains

e364346

fix(prefix-cache): track physical inline snapshot length

922ca0b

fix(prefix-cache): evict inline aliases by physical slot

7d99052

fix(qwen35): materialize exact inline snapshots

1426956

fix(qwen35): avoid long-context q4 KV crash

5ebaf34

perf(kvflash): plan pooled suffix prefill spans

cbe5cc6

dusterbloom force-pushed the fix/prefix-cache-agentic-chain branch from 08c39a0 to cbe5cc6 Compare July 3, 2026 10:09

fix(qwen35): harden kv row base plumbing

a0c1f9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(server): grow exact prefix-cache snapshot chains#468

fix(server): grow exact prefix-cache snapshot chains#468
dusterbloom wants to merge 7 commits into
Luce-Org:mainfrom
dusterbloom:fix/prefix-cache-agentic-chain

dusterbloom commented Jun 30, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dusterbloom commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Validation

Scope

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 30, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading