perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT)#471
perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT)#471dusterbloom wants to merge 1 commit into
Conversation
|
Validated on lucebox2 (RTX 3090, stability profile, Qwen3.6-27B-Q4_K_M + dflash-draft-3.6-q8_0, greedy):
|
|
Follow-up on the earlier LGTM: correctness stands (byte-identical everywhere, spec accept stats exactly equal), but I've now measured the multi-chunk prefill perf claim properly on lucebox2 (RTX 3090, 7436-token prompt = 4 chunks @2048, 3 cache-defeating warm repeats, both run orders to kill the thermal confound):
The PR measures slower in both orderings (~+2.7% midpoint), output hashes identical. Likely mechanism: the engine already computes last-row-only logits per chunk, so there is almost nothing for final-logits-only to save, while the per-chunk graph-shape change adds a little overhead. Not a correctness objection — but as a perf PR it doesn't deliver on a 3090. @dusterbloom which hardware/config shows the win on your side? If it's HIP/large-vocab-specific it may be worth gating rather than defaulting. |
d68b7cc to
42763d5
Compare
|
Good catch — you're right, the logits policy doesn't earn its place. Re-measured both run-orders on a 3090 and it's ~2.7% slower too, byte-identical; it also had a latent argmax bug (offset hardcoded to 0 → reads the first token's argmax instead of the last on multi-token final chunks). Dropped it. The part worth keeping was riding under the wrong title: the graph-level FWHT K-rotation skip for q4_0. I've reduced #471 to just that, and made it opt-in ( Measured here (same binary, env-toggled, both orderings, q4_0/q4_0 KV): skip is a small (~1%, near the n=6 noise floor) prefill win and decode-neutral, needle recall byte-identical both arms. Honest caveat: q4_0/q4_0 only — the intended q4_0 K + q8_0 V needs |
…KIP_WHT) Default keeps the rotation (unchanged behavior); DFLASH_SKIP_WHT=1 drops two turbo-wht kernels per prefill chunk on a q4_0 K cache to match llama.cpp's q4_0 path. Removes the earlier prefill-logits policy (measured slower on 3090, byte-identical) which also had a latent argmax offset bug on multi-token final chunks.
42763d5 to
06a6818
Compare
Summary
Opt-in skip of graph-level FWHT K-rotation for a
q4_0K cache (DFLASH_SKIP_WHT=1). Default keeps the rotation (unchanged behavior), so warm-restore snapshots written with rotated K stay valid across upgrade. When enabled it drops two turbo-wht kernels per prefill chunk to match llama.cpp's leaner q4_0 path.Scope / evidence
Measured on RTX 3090, same binary env-toggled, both run-orders,
q4_0/q4_0KV: opt-in skip is a small (~1%, near the n=6 noise floor) prefill win, decode-neutral, needle recall byte-identical both arms.q4_0K +q8_0V is untested (needsFA_ALL_QUANTS=ONfor the mixed FA kernel). Ships as an opt-in knob, not a default.History
Reduced from the original prefill-logits-policy change (measured ~2.7% slower on a 3090, byte-identical, plus a latent argmax offset bug on multi-token final chunks) down to just the FWHT lever.
test_server_unit: 2042 assertions, 0 failures.