Skip to content

fix(qwen35): route non-tree GDN capture through persist buffer (src[7])#469

Merged
davide221 merged 2 commits into
Luce-Org:mainfrom
dusterbloom:fix/gdn-persist-routing
Jul 2, 2026
Merged

fix(qwen35): route non-tree GDN capture through persist buffer (src[7])#469
davide221 merged 2 commits into
Luce-Org:mainfrom
dusterbloom:fix/gdn-persist-routing

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Route Qwen35 non-tree GDN capture through the persistent intermediate buffer (src[7]) instead of copying from the GDN result tensor embedded intermediate region.

This preserves the current main/#465 pure-chain prefill optimization: ggml_gated_delta_net_set_skip_intermediate() is still applied only when can_skip_gdn_intermediate is true. Capture paths keep intermediates available through the explicit persist buffer.

Why

Fast rollback / chain verify needs per-token SSM intermediate states for rollback. Tree verify already had a persist path via _tree_persist; non-tree capture was still using the legacy result-region copy path. Routing non-tree capture through src[7] avoids allocator lifetime issues and keeps the rollback buffer populated directly.

The unsupported fallback now fails loudly if capture is requested with a non-F32/F16 intermediate buffer, rather than leaving rollback state stale.

Relationship

Validation

  • env CCACHE_DIR=/tmp/lucebox-ccache CCACHE_TEMPDIR=/tmp/lucebox-ccache-tmp cmake --build server/build --target test_server_unit -j 8
  • server/build/test_server_unit
  • Result: 2022 assertions, 0 failures

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_target_graph.cpp Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_target_graph.cpp
@dusterbloom dusterbloom force-pushed the fix/gdn-persist-routing branch from a545f03 to ac34227 Compare June 30, 2026 21:08
Fast rollback / chain verify captures per-token SSM intermediate states for spec-decode rollback. Tree verify already wrote those states directly to a persistent buffer via src[7], while non-tree capture still copied from the GDN result tensor's embedded intermediate region.

Route non-tree capture through the same persist mechanism: build the normal non-tree GDN op, attach cap->ssm_intermediate_states as result->src[7], and let the kernel write intermediates directly to external cache storage. This avoids allocator lifetime issues and removes the redundant result-region copy for the configured F16 capture path.

The rebase keeps current main's can_skip_gdn_intermediate behavior intact, so pure chain prefill can still skip unused intermediates when no capture is requested.
With non-tree capture routed through src[7], the configured F16 intermediate buffer no longer needs the legacy result-region copy path. If a future caller requests capture with a type the persist path cannot handle, abort explicitly instead of leaving rollback state stale.

This keeps fast-rollback failures obvious and prevents silent state corruption.
@davide221

Copy link
Copy Markdown
Contributor

Validated on lucebox2 (RTX 3090 CUDA, stability profile, Qwen3.6-27B-Q4_K_M + dflash-draft-3.6-q8_0, greedy):

  • Dense AR and spec-decode outputs byte-identical to main (6/6 hashes), spec accept stats identical (326/1440, avg_commit 4.27) -> the src[7] persist-buffer capture routing is behaviorally equivalent on the chain path
  • The red 'GPU tests (gfx1151/ROCm)' check is an infra flake, not this PR: that job only runs rocminfo + a HIP vector-add (never builds PR code), and its 14-min failure is the known lucebox3 KFD-wedge signature. Re-run once the runner is rebooted.
  • LGTM to merge from the CUDA side.

@davide221

Copy link
Copy Markdown
Contributor
Screenshot 2026-07-02 alle 18 35 06

@davide221 davide221 merged commit 13ac209 into Luce-Org:main Jul 2, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants