feat(server): add Qwen35MoE target layer-split adapter#475
Conversation
43b8025 to
c449fa4
Compare
There was a problem hiding this comment.
3 issues found across 8 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp">
<violation number="1" location="server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp:36">
P3: This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| @@ -0,0 +1,1439 @@ | |||
| // Qwen35MoE target layer-split adapter. | |||
There was a problem hiding this comment.
P3: This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp, line 36:
<comment>This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.</comment>
<file context>
@@ -0,0 +1,1439 @@
+
+Qwen35MoeLayerSplitAdapter::~Qwen35MoeLayerSplitAdapter() { shutdown(); }
+
+bool Qwen35MoeLayerSplitAdapter::init() {
+ if (cfg_.device.is_layer_split() && cfg_.remote_target_shard.enabled()) {
+ return init_mixed_target_split();
</file context>
There was a problem hiding this comment.
This Qwen35MoE layer-split was explicitly designed as standalone adapter from Qwen35 (dense), because they do have different placement and optimization needs, so sharing a dense/MoE adapter base would couple two paths that we intentionally want to evolve separately.
There was a problem hiding this comment.
The parent comment was wrong here: qwen35moe layer-split is intentionally a standalone adapter, so sharing a dense/MoE base would couple two paths we want to keep separate.
Summary
This PR adds a
qwen35moetarget layer-split entry point for--target-layer-split, while keeping the existingQwen35MoeBackendpath unchanged.The adapter splits by contiguous target layers: each shard runs its local layer span and transfers activations only at shard boundaries. Per-layer execution still uses the existing MoE-aware Qwen35 target graph.
This gives Qwen35MoE a coarse mixed-backend option for pure-GPU deployments where the full weights fit in VRAM, without hot/cold expert scheduling or per-expert IPC orchestration. Avoiding those costs is what lets this path deliver better measured performance than the remote expert-compute path in the same hardware class.
Changes
qwen35moelayer-split branch increate_backendthat selectsQwen35MoeLayerSplitAdapter.Qwen35MoeBackendpath unchanged when target layer split is not requested.Validation
Remote runtime validation on lucebox3:
Qwen3.6-35B-A3B-Q4_K_M.gguf3581prompt tokens /128completion tokenscuda:0 [0,20)+hip:0 [20,40)Existing path comparison on the same remote machine:
1669.5 tok/s91.0 tok/s3.6s887.2 tok/s49.1 tok/s6.7s[0,20)+ HIP[20,40)1193.67 tok/s53.0 tok/s5.5s585.5 tok/s46.8 tok/s10.00sThe clean layer-split rerun shows that Qwen35MoE now has a practical mixed-backend option for pure-GPU topologies where the full weights fit in VRAM. It sits above pure HIP prefill while staying far ahead of the earlier remote expert-compute mixed path, because this PR uses coarse layer ownership instead of per-expert IPC scheduling.