perf(qwen35): use in-place GDN state write in pure AR#473
Draft
dusterbloom wants to merge 7 commits into
Draft
Conversation
In build_delta_net_block, replace the two-op silu(z_4d) + mul(output, z_silu) pattern with ggml_swiglu_split(z_4d, output_n). swiglu_split(a,b)=silu(a)*b, mathematically identical. Removes 30 kernel launches per AR decode step (UNARY-30, MUL-30, GLU+30; graph 2744->2714). Token-identical at temp=0 confirmed. Part 1 of launch-count gap closure (30/162 = 18.5% of gap).
…re AR FWHT gate: DFLASH_NO_WHT=1 disables the K/Q rotation that costs ~16 extra kernels/token (~3% decode). q4_0 KV doesn't need the outlier spreading on Ampere. Default off for TQ3_0 (WHT baked into quant). Repeat_back skip: in pure AR (no tree/capture), the SSM kernel's broadcast handles the head mismatch natively — skip the repeat_back copy per DeltaNet layer per step (~1% decode).
The DeltaNet conv_input concat (conv_states_r, qkv_T) hit the slow concat_f32_dim0 kernel (15us) because both inputs were non-contiguous (reshape + transpose). Wrapping in ggml_cont routes to the fast concat_cont path (3.3us) — ~0.5ms/token saved in the 30-layer serial DeltaNet recurrence.
…n FA bucket build_target_step rebuilds the 2744-node cgraph every decode step, costing ~0.38ms/token and resetting ggml-cuda's CUDA-graph warmup counter. The only per-step topology change is win_len_padded = round_up(committed+1, 256), which only advances every 256 steps. Gate the rebuild on committed/256; within a bucket reuse the cached graph and update only mutable inputs. DFLASH_AR_NO_REUSE=1 restores the per-step rebuild. Measured: +6% decode tok/s (106→113) on Q3_K_XL, q4_0 KV, 16K ctx.
Move ggml_cont from the concat call to the qkv_T transpose definition. This makes qkv_T contiguous once (1 cont per DeltaNet layer) instead of wrapping both concat inputs (2 conts per layer). conv_states_r is already contiguous (reshape of contiguous cache tensor). Net: -30 graph nodes (2704→2674), compute -24µs/step.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Use the ggml in-place GDN final-state write from Lucebox ggml PR24 in the Qwen35 pure-AR path.
This keeps the call site strict: pure AR only, no chunked path, no tree parent ids, no capture or persist intermediate path, and it keeps main behavior from #465.
Dependency
Depends on Lucebox ggml PR24:
Luce-Org/lucebox-ggml#24
This PR intentionally remains draft until that ggml submodule commit is available from the upstream submodule remote.
Validation
cmake --build server/build-pr-gdn-inplace --target test_server_unit -j 8server/build-pr-gdn-inplace/test_server_unit2022 assertions, 0 failures