Skip to content

SBR M1: opt-in tail-pin SSM-snapshot eviction (8× warm-resume on deep agentic convs)#207

Open
tbraun96 wants to merge 4 commits into
mainfrom
feat/sheaf-based-replaying
Open

SBR M1: opt-in tail-pin SSM-snapshot eviction (8× warm-resume on deep agentic convs)#207
tbraun96 wants to merge 4 commits into
mainfrom
feat/sheaf-based-replaying

Conversation

@tbraun96

Copy link
Copy Markdown
Contributor

Problem

On a warm multi-turn hit, KV is reused but the GDN/Mamba SSM layers must replay
their recurrent state from the nearest deeper Marconi checkpoint to the match
point. As the snapshot pool fills, the forecast eviction (last_access·(1+hit_count))
drops a live conversation's deep per-turn checkpoints first — their
hit_count≈0 (they were never the resume anchor), so they look like one-shot
traffic — leaving the next resume to replay from a far-shallow survivor. Replay
distance grows with depth → warm-hit TTFT climbs (measured ~9.5s at ~24k depth;
the 1s→21s pathology).

Fix (M1) — opt-in, OFF by default

SsmSnapshotIndex::evict_lru gains a tail-pin mode: pin the top-K (=8) deepest
snapshots of each resumable session
(a conversation resumed at least once),
evicting only non-pinned entries. A resuming deep conversation then finds a
near-tail SSM anchor instead of a far one.

Measured on dgx2 (Qwen3-Next-80B-NVFP4, deep conv idle under one-shot pressure,
then resumed):

baseline (forecast) tail-pin ON
warm-resume TTFT (mean) 9.53 s 1.18 s (8×)
best cycle 9.74 s 0.45 s (21×)
replay distance ~7,600 tok 11–984 tok

The 0.45 s warm-cycle latency matches llama.cpp's continuous-sequence resume —
without keeping the sequence live. Exact by construction: only which
checkpoint is restored changes; replay is the unchanged bit-exact WY4 path.

Honest scope — why OFF by default

Enabling tail-pin regresses balanced many-conversation round-robin ~30%
(5.9s→7.7s): there the recency·hit forecast is already near-optimal and pinning
fights it, and the regime cannot be reliably detected from the snapshot index's
local view (four gating variants tried, all failed). So it is OFF by default
ATLAS_SBR_TAIL_PIN unset is exactly the baseline forecast (do-no-harm
everywhere). Set ATLAS_SBR_TAIL_PIN=1 (optionally ATLAS_SBR_TAIL_PIN_K) for
deep single/few-conversation agentic workloads, where it delivers the 8×.

Research artifacts (research/sbr/)

Full forensics, benchmark harnesses, and findings, including the negative
results that bound the approach:

  • Dense operator-transport is FLOP-negative for GDN; real GDN decay is
    heavy-tailed (median horizon ~163 tok, but ~12% of heads > 8192).
  • M3 2-D (position × layer) sheaf reconciliation prototyped → honest
    negative: non-vacuous with an oracle inter-layer map but circular/net-negative
    with a fittable one (an accurate inter-layer restriction map is the layer's
    forward pass → no shortcut). M1 is the real win, not the sheaf framing.

Tests

cargo test -p spark-runtime radix_tree::tests::snapshot — 20 pass, incl.
test_tail_pin_protects_top_k_deepest and test_tail_pin_off_is_pure_forecast.

AzeezIsh added 4 commits June 27, 2026 20:38
Warm multi-turn hits replay GDN/Mamba SSM state from the nearest deeper
Marconi checkpoint to the match point; as the snapshot pool fills, the
forecast (last_access x (1+hit_count)) evicts a live conversation's deep
intermediate checkpoints (hit_count=0 -- they were never the resume anchor)
before transient one-shot traffic, stranding the next resume far from its
tail and growing TTFT (measured: ~9.5s, the 1s->21s pathology).

Fix: two-tier eviction in SsmSnapshotIndex::evict_lru -- evict NON-resumable
(one-shot / untracked) snapshots before any entry of a RESUMABLE session
(one with any hit_count>0). Provably do-no-harm: when every entry is
resumable (balanced multi-conversation round-robin) the non-resumable pool is
empty and it degrades identically to the baseline forecast. Exact by
construction -- only which checkpoint is restored changes; replay is the
unchanged bit-exact WY4 path.

Measured (dgx2, Qwen3-Next-80B-NVFP4, deep conv idle under pressure):
baseline 9.53s mean / replay ~7600 tok -> SBR 1.18s mean (8.1x), warm cycles
0.45s (21x, replay 11-984 tok), matching llama.cpp continuous-seq resume
without keeping the sequence live. Gated by ATLAS_SBR_TAIL_PIN (default on).

Earlier "pin top-K deepest" variants regressed the contended multi-conv
regime 5.9s->7.7s by displacing live convs' working sets; the session-level
two-tier discriminator avoids that. Full forensics + benchmark harnesses in
research/sbr/ (PHASE0/1 findings, M1_RESULTS, sbr_*.py).
CPU-only synthetic prototype testing whether a 2-D (position x layer)
cellular-sheaf L0 harmonic reconciliation recovers globally-consistent
multi-layer GDN SSM state from cheap lossy contractive-window
reconstructions with higher fidelity than the lossy input.

Verdict: the cross-layer sheaf does genuine, non-vacuous work ONLY with an
accurate vertical map (oracle: curvature ~0 -> mean 0.9999, beats
no-cross-layer baseline by +0.0048, clears the 0.999 gate). With a fittable
ridge inter-layer map (cosine 0.978, plaquette curvature 0.48) it is
net-negative vs the cheaper horizontal-only (exact GDN recurrence + exact
anchors = M1) baseline at practical tau, and never clears the gate. The win
that exists is horizontal+anchors (M1), not sheaf topology. Bottleneck is
vertical-map quality, not the sheaf math. Honest negative for the practical
lever; do not pursue dgx2 validation absent a high-accuracy inter-layer map.
The two-tier variant was empirically worse (strand cyc0 9.04s) and still
regressed balanced multi-conv, so revert to the validated top-K=8 policy: pin
the top-K deepest snapshots of each resumable session. Split victim selection
into evict_lru_inner(pin, k) for unit-testability.

Multi-conversation sweep settled the scope honestly: enabling tail-pin wins the
single/few-deep-conversation regime ~8x (9.53s->1.18s, replay 11-984 tok) but
regresses balanced many-conversation round-robin ~30% (the recency*hit forecast
is already near-optimal there; pinning fights it, and the regime can't be
detected from the index's local view). Therefore DEFAULT OFF -- ATLAS_SBR_TAIL_PIN
unset is exactly the baseline forecast (provably do-no-harm everywhere); set =1
for deep agentic single-conversation workloads. Exact in both modes.

Also: M3 2-D (position x layer) sheaf reconciliation prototyped -> honest
negative (research/sbr/M3_FINDINGS.md): non-vacuous with an oracle inter-layer
map but circular/net-negative with a fittable one; M1 is the real win.
The background plan that produced M3_FINDINGS.md (honest negative).
@tbraun96 tbraun96 requested a review from AzeezIsh as a code owner June 28, 2026 01:38
print(f"[{a.label}] r{rnd} s{s} depth~{depth_tok:6d} TTFT={ttft:.3f}s", flush=True)

json.dump({"label": a.label, "sessions": a.sessions, "rounds": a.rounds, "rows": rows},
open(a.out, "w"), indent=2)
text,toks=gen(a.base_url,a.model,msgs,64)
res.append({"i":i,"text":text,"toks":toks})
print(f"[{a.label}] prompt {i}: {text[:60]!r}",flush=True)
json.dump({"label":a.label,"res":res},open(a.out,"w"),indent=2)
rows.append({"round":rnd,"conv":s,"depth":depth,"ttft_s":ttft})
print(f"[{a.label}] r{rnd} conv{s} depth~{depth:6d} TTFT={ttft:.3f}s",flush=True)

json.dump({"label":a.label,"convs":a.convs,"rounds":a.rounds,"rows":rows},open(a.out,"w"),indent=2)
json.dump({"label":a.label,"res":res},open(a.out,"w"),indent=2)

def compare(pa,pb):
A=json.load(open(pa))["res"]; B=json.load(open(pb))["res"]
json.dump({"label":a.label,"res":res},open(a.out,"w"),indent=2)

def compare(pa,pb):
A=json.load(open(pa))["res"]; B=json.load(open(pb))["res"]
rows.append({"phase":"resume","cyc":cyc,"depth":depth,"ttft_s":ttft})
print(f"[{a.label}] RESUME cyc{cyc} (after {a.pressure} pressure) depth~{depth:6d} TTFT={ttft:.3f}s",flush=True)

json.dump({"label":a.label,"rows":rows},open(a.out,"w"),indent=2)
python3 sbr_parity.py --label tailpin --out p_pin.json --n 10
python3 sbr_parity.py --compare p_base.json p_pin.json
"""
import argparse, json, math, time, urllib.request
slow_heads = []
# aggregate cosine per tau across heads
agg = {t: [] for t in taus}
h_full_norm_slow = []
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants