Skip to content

feat(server): add Qwen35MoE target layer-split adapter#475

Open
weicj wants to merge 2 commits into
Luce-Org:mainfrom
weicj:feat/qwen35moe-target-layer-split-adapter-main
Open

feat(server): add Qwen35MoE target layer-split adapter#475
weicj wants to merge 2 commits into
Luce-Org:mainfrom
weicj:feat/qwen35moe-target-layer-split-adapter-main

Conversation

@weicj

@weicj weicj commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds a qwen35moe target layer-split entry point for --target-layer-split, while keeping the existing Qwen35MoeBackend path unchanged.

The adapter splits by contiguous target layers: each shard runs its local layer span and transfers activations only at shard boundaries. Per-layer execution still uses the existing MoE-aware Qwen35 target graph.

This gives Qwen35MoE a coarse mixed-backend option for pure-GPU deployments where the full weights fit in VRAM, without hot/cold expert scheduling or per-expert IPC orchestration. Avoiding those costs is what lets this path deliver better measured performance than the remote expert-compute path in the same hardware class.

Changes

  • Add a qwen35moe layer-split branch in create_backend that selects Qwen35MoeLayerSplitAdapter.
  • Keep the dense Qwen35 layer-split adapter unchanged.
  • Add a Qwen35MoE-specific adapter boundary over the shared layer-split substrate.
  • Pass the same target path, device placement, remote target-shard IPC config, chunk size, and DFlash-related adapter config used by the existing Qwen35 split path.
  • Keep the existing Qwen35MoeBackend path unchanged when target layer split is not requested.

Validation

Remote runtime validation on lucebox3:

  • Hardware: RTX 3090 CUDA + Radeon 8060S Strix Halo HIP/gfx1151
  • Model: Qwen3.6-35B-A3B-Q4_K_M.gguf
  • Main request for the pure CUDA, pure HIP, and layer-split rows: 3581 prompt tokens / 128 completion tokens
  • Layer split: cuda:0 [0,20) + hip:0 [20,40)

Existing path comparison on the same remote machine:

Path Placement shape Prefill Decode Wall
Pure RTX 3090 CUDA all-hot all experts on CUDA 1669.5 tok/s 91.0 tok/s 3.6s
Pure Strix Halo HIP all-hot all experts on HIP 887.2 tok/s 49.1 tok/s 6.7s
Qwen35MoE layer split CUDA [0,20) + HIP [20,40) 1193.67 tok/s 53.0 tok/s 5.5s
Qwen35MoE remote expert compute (#388) CUDA parent + HIP remote expert daemon, auto/batched 585.5 tok/s 46.8 tok/s 10.00s

The clean layer-split rerun shows that Qwen35MoE now has a practical mixed-backend option for pure-GPU topologies where the full weights fit in VRAM. It sits above pure HIP prefill while staying far ahead of the earlier remote expert-compute mixed path, because this PR uses coarse layer ownership instead of per-expert IPC scheduling.

Review in cubic

@weicj weicj force-pushed the feat/qwen35moe-target-layer-split-adapter-main branch from 43b8025 to c449fa4 Compare July 1, 2026 02:58
@weicj weicj marked this pull request as ready for review July 1, 2026 03:03

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 8 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp:36">
P3: This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/common/backend_factory.cpp
@@ -0,0 +1,1439 @@
// Qwen35MoE target layer-split adapter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp, line 36:

<comment>This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.</comment>

<file context>
@@ -0,0 +1,1439 @@
+
+Qwen35MoeLayerSplitAdapter::~Qwen35MoeLayerSplitAdapter() { shutdown(); }
+
+bool Qwen35MoeLayerSplitAdapter::init() {
+    if (cfg_.device.is_layer_split() && cfg_.remote_target_shard.enabled()) {
+        return init_mixed_target_split();
</file context>

@weicj weicj Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Qwen35MoE layer-split was explicitly designed as standalone adapter from Qwen35 (dense), because they do have different placement and optimization needs, so sharing a dense/MoE adapter base would couple two paths that we intentionally want to evolve separately.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent comment was wrong here: qwen35moe layer-split is intentionally a standalone adapter, so sharing a dense/MoE base would couple two paths we want to keep separate.

Comment thread server/src/qwen35/graph_builders.cpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant