Skip to content

perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT)#471

Open
dusterbloom wants to merge 1 commit into
Luce-Org:mainfrom
dusterbloom:codex/qwen35-prefill-policy
Open

perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT)#471
dusterbloom wants to merge 1 commit into
Luce-Org:mainfrom
dusterbloom:codex/qwen35-prefill-policy

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Opt-in skip of graph-level FWHT K-rotation for a q4_0 K cache (DFLASH_SKIP_WHT=1). Default keeps the rotation (unchanged behavior), so warm-restore snapshots written with rotated K stay valid across upgrade. When enabled it drops two turbo-wht kernels per prefill chunk to match llama.cpp's leaner q4_0 path.

Scope / evidence

Measured on RTX 3090, same binary env-toggled, both run-orders, q4_0/q4_0 KV: opt-in skip is a small (~1%, near the n=6 noise floor) prefill win, decode-neutral, needle recall byte-identical both arms. q4_0 K + q8_0 V is untested (needs FA_ALL_QUANTS=ON for the mixed FA kernel). Ships as an opt-in knob, not a default.

History

Reduced from the original prefill-logits-policy change (measured ~2.7% slower on a 3090, byte-identical, plus a latent argmax offset bug on multi-token final chunks) down to just the FWHT lever. test_server_unit: 2042 assertions, 0 failures.

@dusterbloom dusterbloom marked this pull request as ready for review June 30, 2026 20:46

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 5 files

Re-trigger cubic

@davide221

Copy link
Copy Markdown
Contributor

Validated on lucebox2 (RTX 3090, stability profile, Qwen3.6-27B-Q4_K_M + dflash-draft-3.6-q8_0, greedy):

  • AR decode outputs byte-identical to main (3/3 hashes incl. a ~5k-token chunked prefill)
  • Spec-decode outputs byte-identical AND accept stats identical (326/1440, avg_commit 4.27 / 103/512, 4.00) -> DFlash feature capture provably unaffected
  • Perf deltas within run-order noise at the locked profile; the removed per-chunk lm_head work is a strict win on longer prefills
  • LGTM to merge.

@davide221

Copy link
Copy Markdown
Contributor

Follow-up on the earlier LGTM: correctness stands (byte-identical everywhere, spec accept stats exactly equal), but I've now measured the multi-chunk prefill perf claim properly on lucebox2 (RTX 3090, 7436-token prompt = 4 chunks @2048, 3 cache-defeating warm repeats, both run orders to kill the thermal confound):

Ordering main warm mean this PR warm mean delta
main first, PR second 11304 ms 11785 ms +4.3% slower
PR first, main second 12262 ms 12422 ms +1.3% slower

The PR measures slower in both orderings (~+2.7% midpoint), output hashes identical. Likely mechanism: the engine already computes last-row-only logits per chunk, so there is almost nothing for final-logits-only to save, while the per-chunk graph-shape change adds a little overhead.

Not a correctness objection — but as a perf PR it doesn't deliver on a 3090. @dusterbloom which hardware/config shows the win on your side? If it's HIP/large-vocab-specific it may be worth gating rather than defaulting.

@dusterbloom dusterbloom force-pushed the codex/qwen35-prefill-policy branch from d68b7cc to 42763d5 Compare July 3, 2026 09:26
@dusterbloom dusterbloom changed the title perf(qwen35): project only final prefill logits perf(qwen35): opt-in skip of graph FWHT K-rotation for q4_0 (DFLASH_SKIP_WHT) Jul 3, 2026
@dusterbloom

Copy link
Copy Markdown
Collaborator Author

Good catch — you're right, the logits policy doesn't earn its place. Re-measured both run-orders on a 3090 and it's ~2.7% slower too, byte-identical; it also had a latent argmax bug (offset hardcoded to 0 → reads the first token's argmax instead of the last on multi-token final chunks). Dropped it.

The part worth keeping was riding under the wrong title: the graph-level FWHT K-rotation skip for q4_0. I've reduced #471 to just that, and made it opt-in (DFLASH_SKIP_WHT=1) rather than a default flip. Default keeps the rotation so warm-restore snapshots written with rotated K stay valid across upgrade — the disk prefix cache is keyed on prompt hash only (no K-basis fingerprint), so a silent default flip would basis-mismatch old snapshots.

Measured here (same binary, env-toggled, both orderings, q4_0/q4_0 KV): skip is a small (~1%, near the n=6 noise floor) prefill win and decode-neutral, needle recall byte-identical both arms. Honest caveat: q4_0/q4_0 only — the intended q4_0 K + q8_0 V needs FA_ALL_QUANTS=ON to compile the mixed FA kernel, untested here. So it ships as an opt-in knob, not a default.

…KIP_WHT)

Default keeps the rotation (unchanged behavior); DFLASH_SKIP_WHT=1 drops two
turbo-wht kernels per prefill chunk on a q4_0 K cache to match llama.cpp's
q4_0 path. Removes the earlier prefill-logits policy (measured slower on 3090,
byte-identical) which also had a latent argmax offset bug on multi-token final
chunks.
@dusterbloom dusterbloom force-pushed the codex/qwen35-prefill-policy branch from 42763d5 to 06a6818 Compare July 3, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants