SBR M1: opt-in tail-pin SSM-snapshot eviction (8× warm-resume on deep agentic convs)#207
Open
tbraun96 wants to merge 4 commits into
Open
SBR M1: opt-in tail-pin SSM-snapshot eviction (8× warm-resume on deep agentic convs)#207tbraun96 wants to merge 4 commits into
tbraun96 wants to merge 4 commits into
Conversation
Warm multi-turn hits replay GDN/Mamba SSM state from the nearest deeper Marconi checkpoint to the match point; as the snapshot pool fills, the forecast (last_access x (1+hit_count)) evicts a live conversation's deep intermediate checkpoints (hit_count=0 -- they were never the resume anchor) before transient one-shot traffic, stranding the next resume far from its tail and growing TTFT (measured: ~9.5s, the 1s->21s pathology). Fix: two-tier eviction in SsmSnapshotIndex::evict_lru -- evict NON-resumable (one-shot / untracked) snapshots before any entry of a RESUMABLE session (one with any hit_count>0). Provably do-no-harm: when every entry is resumable (balanced multi-conversation round-robin) the non-resumable pool is empty and it degrades identically to the baseline forecast. Exact by construction -- only which checkpoint is restored changes; replay is the unchanged bit-exact WY4 path. Measured (dgx2, Qwen3-Next-80B-NVFP4, deep conv idle under pressure): baseline 9.53s mean / replay ~7600 tok -> SBR 1.18s mean (8.1x), warm cycles 0.45s (21x, replay 11-984 tok), matching llama.cpp continuous-seq resume without keeping the sequence live. Gated by ATLAS_SBR_TAIL_PIN (default on). Earlier "pin top-K deepest" variants regressed the contended multi-conv regime 5.9s->7.7s by displacing live convs' working sets; the session-level two-tier discriminator avoids that. Full forensics + benchmark harnesses in research/sbr/ (PHASE0/1 findings, M1_RESULTS, sbr_*.py).
CPU-only synthetic prototype testing whether a 2-D (position x layer) cellular-sheaf L0 harmonic reconciliation recovers globally-consistent multi-layer GDN SSM state from cheap lossy contractive-window reconstructions with higher fidelity than the lossy input. Verdict: the cross-layer sheaf does genuine, non-vacuous work ONLY with an accurate vertical map (oracle: curvature ~0 -> mean 0.9999, beats no-cross-layer baseline by +0.0048, clears the 0.999 gate). With a fittable ridge inter-layer map (cosine 0.978, plaquette curvature 0.48) it is net-negative vs the cheaper horizontal-only (exact GDN recurrence + exact anchors = M1) baseline at practical tau, and never clears the gate. The win that exists is horizontal+anchors (M1), not sheaf topology. Bottleneck is vertical-map quality, not the sheaf math. Honest negative for the practical lever; do not pursue dgx2 validation absent a high-accuracy inter-layer map.
The two-tier variant was empirically worse (strand cyc0 9.04s) and still regressed balanced multi-conv, so revert to the validated top-K=8 policy: pin the top-K deepest snapshots of each resumable session. Split victim selection into evict_lru_inner(pin, k) for unit-testability. Multi-conversation sweep settled the scope honestly: enabling tail-pin wins the single/few-deep-conversation regime ~8x (9.53s->1.18s, replay 11-984 tok) but regresses balanced many-conversation round-robin ~30% (the recency*hit forecast is already near-optimal there; pinning fights it, and the regime can't be detected from the index's local view). Therefore DEFAULT OFF -- ATLAS_SBR_TAIL_PIN unset is exactly the baseline forecast (provably do-no-harm everywhere); set =1 for deep agentic single-conversation workloads. Exact in both modes. Also: M3 2-D (position x layer) sheaf reconciliation prototyped -> honest negative (research/sbr/M3_FINDINGS.md): non-vacuous with an oracle inter-layer map but circular/net-negative with a fittable one; M1 is the real win.
The background plan that produced M3_FINDINGS.md (honest negative).
| print(f"[{a.label}] r{rnd} s{s} depth~{depth_tok:6d} TTFT={ttft:.3f}s", flush=True) | ||
|
|
||
| json.dump({"label": a.label, "sessions": a.sessions, "rounds": a.rounds, "rows": rows}, | ||
| open(a.out, "w"), indent=2) |
| text,toks=gen(a.base_url,a.model,msgs,64) | ||
| res.append({"i":i,"text":text,"toks":toks}) | ||
| print(f"[{a.label}] prompt {i}: {text[:60]!r}",flush=True) | ||
| json.dump({"label":a.label,"res":res},open(a.out,"w"),indent=2) |
| rows.append({"round":rnd,"conv":s,"depth":depth,"ttft_s":ttft}) | ||
| print(f"[{a.label}] r{rnd} conv{s} depth~{depth:6d} TTFT={ttft:.3f}s",flush=True) | ||
|
|
||
| json.dump({"label":a.label,"convs":a.convs,"rounds":a.rounds,"rows":rows},open(a.out,"w"),indent=2) |
| json.dump({"label":a.label,"res":res},open(a.out,"w"),indent=2) | ||
|
|
||
| def compare(pa,pb): | ||
| A=json.load(open(pa))["res"]; B=json.load(open(pb))["res"] |
| json.dump({"label":a.label,"res":res},open(a.out,"w"),indent=2) | ||
|
|
||
| def compare(pa,pb): | ||
| A=json.load(open(pa))["res"]; B=json.load(open(pb))["res"] |
| rows.append({"phase":"resume","cyc":cyc,"depth":depth,"ttft_s":ttft}) | ||
| print(f"[{a.label}] RESUME cyc{cyc} (after {a.pressure} pressure) depth~{depth:6d} TTFT={ttft:.3f}s",flush=True) | ||
|
|
||
| json.dump({"label":a.label,"rows":rows},open(a.out,"w"),indent=2) |
| python3 sbr_parity.py --label tailpin --out p_pin.json --n 10 | ||
| python3 sbr_parity.py --compare p_base.json p_pin.json | ||
| """ | ||
| import argparse, json, math, time, urllib.request |
| slow_heads = [] | ||
| # aggregate cosine per tau across heads | ||
| agg = {t: [] for t in taus} | ||
| h_full_norm_slow = [] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On a warm multi-turn hit, KV is reused but the GDN/Mamba SSM layers must replay
their recurrent state from the nearest deeper Marconi checkpoint to the match
point. As the snapshot pool fills, the forecast eviction (
last_access·(1+hit_count))drops a live conversation's deep per-turn checkpoints first — their
hit_count≈0(they were never the resume anchor), so they look like one-shottraffic — leaving the next resume to replay from a far-shallow survivor. Replay
distance grows with depth → warm-hit TTFT climbs (measured ~9.5s at ~24k depth;
the 1s→21s pathology).
Fix (M1) — opt-in, OFF by default
SsmSnapshotIndex::evict_lrugains a tail-pin mode: pin the top-K (=8) deepestsnapshots of each resumable session (a conversation resumed at least once),
evicting only non-pinned entries. A resuming deep conversation then finds a
near-tail SSM anchor instead of a far one.
Measured on dgx2 (Qwen3-Next-80B-NVFP4, deep conv idle under one-shot pressure,
then resumed):
The 0.45 s warm-cycle latency matches llama.cpp's continuous-sequence resume —
without keeping the sequence live. Exact by construction: only which
checkpoint is restored changes; replay is the unchanged bit-exact WY4 path.
Honest scope — why OFF by default
Enabling tail-pin regresses balanced many-conversation round-robin ~30%
(5.9s→7.7s): there the recency·hit forecast is already near-optimal and pinning
fights it, and the regime cannot be reliably detected from the snapshot index's
local view (four gating variants tried, all failed). So it is OFF by default
—
ATLAS_SBR_TAIL_PINunset is exactly the baseline forecast (do-no-harmeverywhere). Set
ATLAS_SBR_TAIL_PIN=1(optionallyATLAS_SBR_TAIL_PIN_K) fordeep single/few-conversation agentic workloads, where it delivers the 8×.
Research artifacts (
research/sbr/)Full forensics, benchmark harnesses, and findings, including the negative
results that bound the approach:
heavy-tailed (median horizon ~163 tok, but ~12% of heads > 8192).
negative: non-vacuous with an oracle inter-layer map but circular/net-negative
with a fittable one (an accurate inter-layer restriction map is the layer's
forward pass → no shortcut). M1 is the real win, not the sheaf framing.
Tests
cargo test -p spark-runtime radix_tree::tests::snapshot— 20 pass, incl.test_tail_pin_protects_top_k_deepestandtest_tail_pin_off_is_pure_forecast.