Skip to content

perf(qwen35): use in-place GDN state write in pure AR#473

Draft
dusterbloom wants to merge 7 commits into
Luce-Org:mainfrom
dusterbloom:perf/qwen35-gdn-inplace-consumer
Draft

perf(qwen35): use in-place GDN state write in pure AR#473
dusterbloom wants to merge 7 commits into
Luce-Org:mainfrom
dusterbloom:perf/qwen35-gdn-inplace-consumer

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Use the ggml in-place GDN final-state write from Lucebox ggml PR24 in the Qwen35 pure-AR path.

This keeps the call site strict: pure AR only, no chunked path, no tree parent ids, no capture or persist intermediate path, and it keeps main behavior from #465.

Dependency

Depends on Lucebox ggml PR24:
Luce-Org/lucebox-ggml#24

This PR intentionally remains draft until that ggml submodule commit is available from the upstream submodule remote.

Validation

  • cmake --build server/build-pr-gdn-inplace --target test_server_unit -j 8
  • server/build-pr-gdn-inplace/test_server_unit
  • Result: 2022 assertions, 0 failures

Review in cubic

In build_delta_net_block, replace the two-op silu(z_4d) + mul(output, z_silu)
pattern with ggml_swiglu_split(z_4d, output_n). swiglu_split(a,b)=silu(a)*b,
mathematically identical. Removes 30 kernel launches per AR decode step
(UNARY-30, MUL-30, GLU+30; graph 2744->2714). Token-identical at temp=0
confirmed. Part 1 of launch-count gap closure (30/162 = 18.5% of gap).
…re AR

FWHT gate: DFLASH_NO_WHT=1 disables the K/Q rotation that costs ~16 extra
kernels/token (~3% decode). q4_0 KV doesn't need the outlier spreading on
Ampere. Default off for TQ3_0 (WHT baked into quant).

Repeat_back skip: in pure AR (no tree/capture), the SSM kernel's broadcast
handles the head mismatch natively — skip the repeat_back copy per DeltaNet
layer per step (~1% decode).
The DeltaNet conv_input concat (conv_states_r, qkv_T) hit the slow
concat_f32_dim0 kernel (15us) because both inputs were non-contiguous
(reshape + transpose). Wrapping in ggml_cont routes to the fast
concat_cont path (3.3us) — ~0.5ms/token saved in the 30-layer serial
DeltaNet recurrence.
…n FA bucket

build_target_step rebuilds the 2744-node cgraph every decode step, costing
~0.38ms/token and resetting ggml-cuda's CUDA-graph warmup counter. The only
per-step topology change is win_len_padded = round_up(committed+1, 256),
which only advances every 256 steps. Gate the rebuild on committed/256;
within a bucket reuse the cached graph and update only mutable inputs.

DFLASH_AR_NO_REUSE=1 restores the per-step rebuild.

Measured: +6% decode tok/s (106→113) on Q3_K_XL, q4_0 KV, 16K ctx.
Move ggml_cont from the concat call to the qkv_T transpose definition.
This makes qkv_T contiguous once (1 cont per DeltaNet layer) instead of
wrapping both concat inputs (2 conts per layer). conv_states_r is already
contiguous (reshape of contiguous cache tensor). Net: -30 graph nodes
(2704→2674), compute -24µs/step.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant