perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT) by dusterbloom · Pull Request #471 · Luce-Org/lucebox-hub

dusterbloom · 2026-06-30T20:42:22Z

Summary

Opt-in skip of graph-level FWHT K-rotation for a q4_0 K cache (DFLASH_SKIP_WHT=1). Default keeps the rotation (unchanged behavior), so warm-restore snapshots written with rotated K stay valid across upgrade. When enabled it drops two turbo-wht kernels per prefill chunk to match llama.cpp's leaner q4_0 path.

Scope / evidence

Measured on RTX 3090, same binary env-toggled, both run-orders, q4_0/q4_0 KV: opt-in skip is a small (~1%, near the n=6 noise floor) prefill win, decode-neutral, needle recall byte-identical both arms. q4_0 K + q8_0 V is untested (needs FA_ALL_QUANTS=ON for the mixed FA kernel). Ships as an opt-in knob, not a default.

History

Reduced from the original prefill-logits-policy change (measured ~2.7% slower on a 3090, byte-identical, plus a latent argmax offset bug on multi-token final chunks) down to just the FWHT lever. test_server_unit: 2042 assertions, 0 failures.

cubic-dev-ai

No issues found across 5 files

_{Re-trigger cubic}

davide221 · 2026-07-02T13:34:34Z

Validated on lucebox2 (RTX 3090, stability profile, Qwen3.6-27B-Q4_K_M + dflash-draft-3.6-q8_0, greedy):

AR decode outputs byte-identical to main (3/3 hashes incl. a ~5k-token chunked prefill)
Spec-decode outputs byte-identical AND accept stats identical (326/1440, avg_commit 4.27 / 103/512, 4.00) -> DFlash feature capture provably unaffected
Perf deltas within run-order noise at the locked profile; the removed per-chunk lm_head work is a strict win on longer prefills
LGTM to merge.

davide221 · 2026-07-02T18:32:43Z

Follow-up on the earlier LGTM: correctness stands (byte-identical everywhere, spec accept stats exactly equal), but I've now measured the multi-chunk prefill perf claim properly on lucebox2 (RTX 3090, 7436-token prompt = 4 chunks @2048, 3 cache-defeating warm repeats, both run orders to kill the thermal confound):

Ordering	main warm mean	this PR warm mean	delta
main first, PR second	11304 ms	11785 ms	+4.3% slower
PR first, main second	12262 ms	12422 ms	+1.3% slower

The PR measures slower in both orderings (~+2.7% midpoint), output hashes identical. Likely mechanism: the engine already computes last-row-only logits per chunk, so there is almost nothing for final-logits-only to save, while the per-chunk graph-shape change adds a little overhead.

Not a correctness objection — but as a perf PR it doesn't deliver on a 3090. @dusterbloom which hardware/config shows the win on your side? If it's HIP/large-vocab-specific it may be worth gating rather than defaulting.

dusterbloom · 2026-07-03T09:27:11Z

Good catch — you're right, the logits policy doesn't earn its place. Re-measured both run-orders on a 3090 and it's ~2.7% slower too, byte-identical; it also had a latent argmax bug (offset hardcoded to 0 → reads the first token's argmax instead of the last on multi-token final chunks). Dropped it.

The part worth keeping was riding under the wrong title: the graph-level FWHT K-rotation skip for q4_0. I've reduced #471 to just that, and made it opt-in (DFLASH_SKIP_WHT=1) rather than a default flip. Default keeps the rotation so warm-restore snapshots written with rotated K stay valid across upgrade — the disk prefix cache is keyed on prompt hash only (no K-basis fingerprint), so a silent default flip would basis-mismatch old snapshots.

Measured here (same binary, env-toggled, both orderings, q4_0/q4_0 KV): skip is a small (~1%, near the n=6 noise floor) prefill win and decode-neutral, needle recall byte-identical both arms. Honest caveat: q4_0/q4_0 only — the intended q4_0 K + q8_0 V needs FA_ALL_QUANTS=ON to compile the mixed FA kernel, untested here. So it ships as an opt-in knob, not a default.

…KIP_WHT) Default keeps the rotation (unchanged behavior); DFLASH_SKIP_WHT=1 drops two turbo-wht kernels per prefill chunk on a q4_0 K cache to match llama.cpp's q4_0 path. Removes the earlier prefill-logits policy (measured slower on 3090, byte-identical) which also had a latent argmax offset bug on multi-token final chunks.

dusterbloom marked this pull request as ready for review June 30, 2026 20:46

cubic-dev-ai Bot reviewed Jun 30, 2026

View reviewed changes

dusterbloom force-pushed the codex/qwen35-prefill-policy branch from d68b7cc to 42763d5 Compare July 3, 2026 09:26

dusterbloom changed the title ~~perf(qwen35): project only final prefill logits~~ perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT) Jul 3, 2026

dusterbloom force-pushed the codex/qwen35-prefill-policy branch from 42763d5 to 06a6818 Compare July 3, 2026 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT)#471

perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT)#471
dusterbloom wants to merge 1 commit into
Luce-Org:mainfrom
dusterbloom:codex/qwen35-prefill-policy

dusterbloom commented Jun 30, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

davide221 commented Jul 2, 2026

Uh oh!

davide221 commented Jul 2, 2026

Uh oh!

dusterbloom commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dusterbloom commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope / evidence

History

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davide221 commented Jul 2, 2026

Uh oh!

davide221 commented Jul 2, 2026

Uh oh!

dusterbloom commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dusterbloom commented Jun 30, 2026 •

edited

Loading