Skip to content

perf(qwen35): recover dense AR replay fast path#476

Draft
dusterbloom wants to merge 6 commits into
Luce-Org:mainfrom
dusterbloom:perf/qwen35-ar-replay-microopts
Draft

perf(qwen35): recover dense AR replay fast path#476
dusterbloom wants to merge 6 commits into
Luce-Org:mainfrom
dusterbloom:perf/qwen35-ar-replay-microopts

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

Extract the server-only dense AR replay micro-optimizations from the replay-win branch into a focused PR that does not depend on ggml PR24 or any submodule bump.

This keeps the in-place GDN/submodule work isolated in #473, while letting reviewers evaluate the independent Qwen35 AR decode fast path now.

What changed

  • Fuse DeltaNet gated output with ggml_swiglu_split.
  • Skip graph-level FWHT K rotation for q4_0 by default, while preserving explicit overrides:
    • DFLASH_FORCE_WHT=1 restores the old graph-level rotation.
    • DFLASH_NO_WHT disables it.
  • Skip Q/K repeat-back in pure AR where the shape is already single-token decode.
  • Use contiguous concat inputs / materialized qkv_T to reduce concat-cont overhead.
  • Reuse the Qwen35 AR decode graph inside the same 256-token FA bucket so CUDA graph warmup/replay is not reset every token.

Scope and dependencies

Why

Our best 6-turn dense replay result recovered a decode-rate win over llama.cpp, but #473 mixed those micro-opts with the ggml-dependent in-place GDN consumer. Splitting this lets the low-risk server-side perf work land independently and keeps the reviewer surface small.

Latest measured replay context from the winning stack:

  • llama.cpp slot-cache: prefill about 6.900s, decode 5.394s, wall 12.294s, weighted decode 42.642 tok/s.
  • Luce patched/final stack: prefill about 8.1-8.7s, decode about 5.28-5.40s, wall about 13.6-14.1s, weighted decode about 43.6 tok/s.

This PR carries the server-only part of that decode recovery. Prefill parity remains separate work.

Validation

Local focused worktree, upstream ggml submodule a317b0ea0fc9eb716a311976fed8dc0f301dc09f:

  • env CC=/usr/bin/gcc-11 CXX=/usr/bin/g++-11 CUDACXX=/usr/local/cuda-12.6/bin/nvcc LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib cmake -S server -B server/build-ar-replay-microopts-gcc11 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 -DGGML_CCACHE=OFF
  • env LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib cmake --build server/build-ar-replay-microopts-gcc11 --target dflash_server test_server_unit -j 8
  • env LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:/usr/lib/wsl/lib server/build-ar-replay-microopts-gcc11/test_server_unit
  • Result: 2022 assertions, 0 failures

Notes:

  • A first configure with /usr/bin/nvcc 12.0 + GCC 13 failed at CUDA compiler identification due the known glibc _Float* incompatibility; CUDA 12.6 + GCC 11 configured and built successfully.
  • The build disables optional BSA because that optional submodule was not initialized in this focused worktree.

Review in cubic

In build_delta_net_block, replace the two-op silu(z_4d) + mul(output, z_silu)
pattern with ggml_swiglu_split(z_4d, output_n). swiglu_split(a,b)=silu(a)*b,
mathematically identical. Removes 30 kernel launches per AR decode step
(UNARY-30, MUL-30, GLU+30; graph 2744->2714). Token-identical at temp=0
confirmed. Part 1 of launch-count gap closure (30/162 = 18.5% of gap).

(cherry picked from commit cfc9ff7)
…re AR

FWHT gate: DFLASH_NO_WHT=1 disables the K/Q rotation that costs ~16 extra
kernels/token (~3% decode). q4_0 KV doesn't need the outlier spreading on
Ampere. Default off for TQ3_0 (WHT baked into quant).

Repeat_back skip: in pure AR (no tree/capture), the SSM kernel's broadcast
handles the head mismatch natively — skip the repeat_back copy per DeltaNet
layer per step (~1% decode).

(cherry picked from commit 309b1ce)
The DeltaNet conv_input concat (conv_states_r, qkv_T) hit the slow
concat_f32_dim0 kernel (15us) because both inputs were non-contiguous
(reshape + transpose). Wrapping in ggml_cont routes to the fast
concat_cont path (3.3us) — ~0.5ms/token saved in the 30-layer serial
DeltaNet recurrence.

(cherry picked from commit ca0ba8d)
…n FA bucket

build_target_step rebuilds the 2744-node cgraph every decode step, costing
~0.38ms/token and resetting ggml-cuda's CUDA-graph warmup counter. The only
per-step topology change is win_len_padded = round_up(committed+1, 256),
which only advances every 256 steps. Gate the rebuild on committed/256;
within a bucket reuse the cached graph and update only mutable inputs.

DFLASH_AR_NO_REUSE=1 restores the per-step rebuild.

Measured: +6% decode tok/s (106→113) on Q3_K_XL, q4_0 KV, 16K ctx.
(cherry picked from commit dbf4ed9)
Move ggml_cont from the concat call to the qkv_T transpose definition.
This makes qkv_T contiguous once (1 cont per DeltaNet layer) instead of
wrapping both concat inputs (2 conts per layer). conv_states_r is already
contiguous (reshape of contiguous cache tensor). Net: -30 graph nodes
(2704→2674), compute -24µs/step.

(cherry picked from commit e3db53f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant