feat(server): add Qwen35MoE target layer-split adapter by weicj · Pull Request #475 · Luce-Org/lucebox-hub

weicj · 2026-07-01T01:55:51Z

Summary

This PR adds a qwen35moe target layer-split entry point for --target-layer-split, while keeping the existing Qwen35MoeBackend path unchanged.

The adapter splits by contiguous target layers: each shard runs its local layer span and transfers activations only at shard boundaries. Per-layer execution still uses the existing MoE-aware Qwen35 target graph.

This gives Qwen35MoE a coarse mixed-backend option for pure-GPU deployments where the full weights fit in VRAM, without hot/cold expert scheduling or per-expert IPC orchestration. Avoiding those costs is what lets this path deliver better measured performance than the remote expert-compute path in the same hardware class.

Changes

Add a qwen35moe layer-split branch in create_backend that selects Qwen35MoeLayerSplitAdapter.
Keep the dense Qwen35 layer-split adapter unchanged.
Add a Qwen35MoE-specific adapter boundary over the shared layer-split substrate.
Pass the same target path, device placement, remote target-shard IPC config, chunk size, and DFlash-related adapter config used by the existing Qwen35 split path.
Keep the existing Qwen35MoeBackend path unchanged when target layer split is not requested.

Validation

Remote runtime validation on lucebox3:

Hardware: RTX 3090 CUDA + Radeon 8060S Strix Halo HIP/gfx1151
Model: Qwen3.6-35B-A3B-Q4_K_M.gguf
Main request for the pure CUDA, pure HIP, and layer-split rows: 3581 prompt tokens / 128 completion tokens
Layer split: cuda:0 [0,20) + hip:0 [20,40)

Existing path comparison on the same remote machine:

Path	Placement shape	Prefill	Decode	Wall
Pure RTX 3090 CUDA all-hot	all experts on CUDA	`1669.5 tok/s`	`91.0 tok/s`	`3.6s`
Pure Strix Halo HIP all-hot	all experts on HIP	`887.2 tok/s`	`49.1 tok/s`	`6.7s`
Qwen35MoE layer split	CUDA `[0,20)` + HIP `[20,40)`	`1193.67 tok/s`	`53.0 tok/s`	`5.5s`
Qwen35MoE remote expert compute (#388)	CUDA parent + HIP remote expert daemon, auto/batched	`585.5 tok/s`	`46.8 tok/s`	`10.00s`

The clean layer-split rerun shows that Qwen35MoE now has a practical mixed-backend option for pure-GPU topologies where the full weights fit in VRAM. It sits above pure HIP prefill while staying far ahead of the earlier remote expert-compute mixed path, because this PR uses coarse layer ownership instead of per-expert IPC scheduling.

cubic-dev-ai

3 issues found across 8 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp">

<violation number="1" location="server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp:36">
P3: This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-07-01T03:11:26Z

@@ -0,0 +1,1439 @@
+// Qwen35MoE target layer-split adapter.


P3: This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_layer_split_adapter.cpp, line 36: <comment>This new adapter duplicates the existing qwen35 layer-split adapter instead of sharing the substrate. Factor common lifecycle/kvflash/snapshot/decode code into a shared base/helper so qwen35 and qwen35moe do not diverge.</comment> <file context> @@ -0,0 +1,1439 @@ + +Qwen35MoeLayerSplitAdapter::~Qwen35MoeLayerSplitAdapter() { shutdown(); } + +bool Qwen35MoeLayerSplitAdapter::init() { + if (cfg_.device.is_layer_split() && cfg_.remote_target_shard.enabled()) { + return init_mixed_target_split(); </file context>

This Qwen35MoE layer-split was explicitly designed as standalone adapter from Qwen35 (dense), because they do have different placement and optimization needs, so sharing a dense/MoE adapter base would couple two paths that we intentionally want to evolve separately.

The parent comment was wrong here: qwen35moe layer-split is intentionally a standalone adapter, so sharing a dense/MoE base would couple two paths we want to keep separate.

feat(server): add qwen35moe target layer-split adapter

c449fa4

weicj force-pushed the feat/qwen35moe-target-layer-split-adapter-main branch from 43b8025 to c449fa4 Compare July 1, 2026 02:58

weicj marked this pull request as ready for review July 1, 2026 03:03

cubic-dev-ai Bot reviewed Jul 1, 2026

View reviewed changes

fix(server): address qwen35moe layer-split review feedback

08805ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(server): add Qwen35MoE target layer-split adapter#475

feat(server): add Qwen35MoE target layer-split adapter#475
weicj wants to merge 2 commits into
Luce-Org:mainfrom
weicj:feat/qwen35moe-target-layer-split-adapter-main

weicj commented Jul 1, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot Jul 1, 2026

Uh oh!

weicj Jul 1, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

weicj commented Jul 1, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

weicj Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weicj commented Jul 1, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

weicj Jul 1, 2026 •

edited

Loading