perf(qwen35): recover dense AR replay fast path by dusterbloom · Pull Request #476 · Luce-Org/lucebox-hub

dusterbloom · 2026-07-01T08:31:02Z

Summary

Extract the server-only dense AR replay micro-optimizations from the replay-win branch into a focused PR that does not depend on ggml PR24 or any submodule bump.

This keeps the in-place GDN/submodule work isolated in #473, while letting reviewers evaluate the independent Qwen35 AR decode fast path now.

What changed

Fuse DeltaNet gated output with ggml_swiglu_split.
Skip graph-level FWHT K rotation for q4_0 by default, while preserving explicit overrides:
- DFLASH_FORCE_WHT=1 restores the old graph-level rotation.
- DFLASH_NO_WHT disables it.
Skip Q/K repeat-back in pure AR where the shape is already single-token decode.
Use contiguous concat inputs / materialized qkv_T to reduce concat-cont overhead.
Reuse the Qwen35 AR decode graph inside the same 256-token FA bucket so CUDA graph warmup/replay is not reset every token.

Scope and dependencies

Server-only: Qwen35 backend/graph code and one small graph-options helper header.
No server/deps/llama.cpp submodule change.
No ggml_cuda_set_skip_props_check / props-check hook.
No in-place GDN API use; that remains in perf(qwen35): use in-place GDN state write in pure AR #473 and depends on ggml PR24.
Does not include the KVFlash library/integration work from feat(kvflash): pager serialize/deserialize + QK residency library #466/fix(server): grow exact prefix-cache snapshot chains #468/fix(qwen35): route non-tree GDN capture through persist buffer (src[7]) #469/fix(server): recover single-tool JSON argument responses #470.

Why

Our best 6-turn dense replay result recovered a decode-rate win over llama.cpp, but #473 mixed those micro-opts with the ggml-dependent in-place GDN consumer. Splitting this lets the low-risk server-side perf work land independently and keeps the reviewer surface small.

Latest measured replay context from the winning stack:

llama.cpp slot-cache: prefill about 6.900s, decode 5.394s, wall 12.294s, weighted decode 42.642 tok/s.
Luce patched/final stack: prefill about 8.1-8.7s, decode about 5.28-5.40s, wall about 13.6-14.1s, weighted decode about 43.6 tok/s.

This PR carries the server-only part of that decode recovery. Prefill parity remains separate work.

Validation

Local focused worktree, upstream ggml submodule a317b0ea0fc9eb716a311976fed8dc0f301dc09f:

env CC=/usr/bin/gcc-11 CXX=/usr/bin/g++-11 CUDACXX=/usr/local/cuda-12.6/bin/nvcc LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib cmake -S server -B server/build-ar-replay-microopts-gcc11 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 -DGGML_CCACHE=OFF
env LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib cmake --build server/build-ar-replay-microopts-gcc11 --target dflash_server test_server_unit -j 8
env LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib server/build-ar-replay-microopts-gcc11/test_server_unit
Result: 2022 assertions, 0 failures

Notes:

A first configure with /usr/bin/nvcc 12.0 + GCC 13 failed at CUDA compiler identification due the known glibc _Float* incompatibility; CUDA 12.6 + GCC 11 configured and built successfully.
The build disables optional BSA because that optional submodule was not initialized in this focused worktree.

In build_delta_net_block, replace the two-op silu(z_4d) + mul(output, z_silu) pattern with ggml_swiglu_split(z_4d, output_n). swiglu_split(a,b)=silu(a)*b, mathematically identical. Removes 30 kernel launches per AR decode step (UNARY-30, MUL-30, GLU+30; graph 2744->2714). Token-identical at temp=0 confirmed. Part 1 of launch-count gap closure (30/162 = 18.5% of gap). (cherry picked from commit cfc9ff7)

…re AR FWHT gate: DFLASH_NO_WHT=1 disables the K/Q rotation that costs ~16 extra kernels/token (~3% decode). q4_0 KV doesn't need the outlier spreading on Ampere. Default off for TQ3_0 (WHT baked into quant). Repeat_back skip: in pure AR (no tree/capture), the SSM kernel's broadcast handles the head mismatch natively — skip the repeat_back copy per DeltaNet layer per step (~1% decode). (cherry picked from commit 309b1ce)

The DeltaNet conv_input concat (conv_states_r, qkv_T) hit the slow concat_f32_dim0 kernel (15us) because both inputs were non-contiguous (reshape + transpose). Wrapping in ggml_cont routes to the fast concat_cont path (3.3us) — ~0.5ms/token saved in the 30-layer serial DeltaNet recurrence. (cherry picked from commit ca0ba8d)

…n FA bucket build_target_step rebuilds the 2744-node cgraph every decode step, costing ~0.38ms/token and resetting ggml-cuda's CUDA-graph warmup counter. The only per-step topology change is win_len_padded = round_up(committed+1, 256), which only advances every 256 steps. Gate the rebuild on committed/256; within a bucket reuse the cached graph and update only mutable inputs. DFLASH_AR_NO_REUSE=1 restores the per-step rebuild. Measured: +6% decode tok/s (106→113) on Q3_K_XL, q4_0 KV, 16K ctx. (cherry picked from commit dbf4ed9)

Move ggml_cont from the concat call to the qkv_T transpose definition. This makes qkv_T contiguous once (1 cont per DeltaNet layer) instead of wrapping both concat inputs (2 conts per layer). conv_states_r is already contiguous (reshape of contiguous cache tensor). Net: -30 graph nodes (2704→2674), compute -24µs/step. (cherry picked from commit e3db53f)

dusterbloom added 6 commits July 1, 2026 09:29

perf(qwen35): keep AR replay micro-opts submodule-free

38e0cf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(qwen35): recover dense AR replay fast path#476

perf(qwen35): recover dense AR replay fast path#476
dusterbloom wants to merge 6 commits into
Luce-Org:mainfrom
dusterbloom:perf/qwen35-ar-replay-microopts

dusterbloom commented Jul 1, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dusterbloom commented Jul 1, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Scope and dependencies

Why

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jul 1, 2026 •

edited by cubic-dev-ai Bot

Loading