[MLX] Qwen3.5 MoE ondevice sampling by kiymetakdemir · Pull Request #20545 · pytorch/executorch

kiymetakdemir · 2026-06-26T16:26:50Z

Summary
Lets the MLX-exported Qwen3.5 MoE model sample the next token on-device instead of returning logits for host-side sampling. Sampling is opt-in at export (--sample); temperature, top_p, and seed are runtime inputs, and the runner increments the seed per token. Measured ~17% higher decode throughput.

Changes

export.py --sample flag wraps the model in SamplingHead so forward(tokens, input_pos, temperature, top_p, seed) → int64 token; records a use_sampling constant-method flag. Non-sample export unchanged.
qwen35_moe_engine.cpp reads use_sampling from metadata; when set, consumes the token id directly instead of logits_to_token, feeds the scalar inputs in prefill/decode, and manages the per-token seed schedule. top_k is still rejected; top_p/seed are rejected on non-sample models.
main.cpp --top_p / --seed flags wired into SamplingConfig.
run.py detects use_sampling, mirrors the same seed schedule for parity, and rejects top_p/seed on non-sample models (matching the C++ runner). --top-p / --seed flags added.

Verification

Real model (C++ runner): coherent, constraint-following generation; same --seed reproducible; --temperature 0 deterministic.
+17% decode throughput vs host sampling; prefill unchanged.

pytorch-bot · 2026-06-26T16:26:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20545

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Pending, 1 Unrelated Failure, 2 Unclassified Failures

As of commit 2ad867d with merge base 51729bb ():

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Aarch64 Linux Wheels / pytorch/executorch / build-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/__w/executorch/executorch/pytorch/executorch/backends/apple/coreml/runtime/inmemoryfs/inmemory_filesystem.cpp:722:48: error: ‘inmemoryfs::InMemoryFileSystem::InMemoryNode::Kind’ has not been declared
Build Aarch64 Linux Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_aarch64

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-26T16:27:39Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

metascroy · 2026-06-26T17:03:53Z

    model.forward = types.MethodType(_clean_forward, model)


+class _MLXSampleWrapper(nn.Module):


Can we just use SamplingHead directly? What does the wrapper do extra?

It is because temperature, top_p, seed are keyword only arguments, the wrapper maps them into positional args. I can pass them to export's kwargs instead but it also makes dynamic_shapes a structured args/kwargs dict rather than a tuple.

metascroy · 2026-06-26T17:04:02Z

+    if sample:
+        # forward(tokens, input_pos, temperature, top_p, seed) -> token id.
+        # Scalars are static (None in dynamic_shapes); only the seq dim is dynamic.
+        model = _MLXSampleWrapper(model)


Use SamplingHead directly?

metascroy · 2026-06-26T17:05:18Z

+    "with --sample (MLX on-device sampling).");
+DEFINE_int64(
+    seed,
+    0,


Can default mean we generate base seed by calling random int from c++ library?

Added this.

metascroy · 2026-06-26T17:06:19Z

@claude review this code

claude · 2026-06-26T17:07:13Z

Claude finished @metascroy's task in 2m 7s —— View job

Code review: Qwen3.5 MoE on-device sampling

Read all four changed files plus backends/mlx/llm/sampling.py (SamplingHead). Overall this is clean and the seed schedule parity between the C++ runner and run.py checks out. Below are answers to the inline questions and a few findings.

Answering the reviewer questions

1. export.py:736 / :780 — "Can we just use SamplingHead directly?"

Not directly with the current positional-export setup. SamplingHead.forward makes temperature, top_k, top_p, seed keyword-only:

def forward(self, *args, temperature, top_k=None, top_p=1.0, seed=None, **kwargs):

The export builds a positional example_args tuple and a positional dynamic_shapes tuple:

example_args = (tokens, input_pos, temperature, top_p, seed)
dynamic_shapes = ({1: seq_dim}, {0: seq_dim}, None, None, None)

You can't feed keyword-only params positionally, so the C++/run.py callers (which pass 5 positional inputs) wouldn't line up with a bare SamplingHead. _MLXSampleWrapper exists purely to flatten the signature to positional forward(tokens, input_pos, temperature, top_p, seed). That's the only thing it adds — no extra logic.

Two ways to drop the wrapper if you prefer:

Export SamplingHead directly via export(head, args=(tokens, input_pos), kwargs={"temperature":..., "top_p":..., "seed":...}, dynamic_shapes={...dict...}). This changes the exported graph's input order/naming (kwargs come after args), so the C++ inputs.push_back order and run.py would need to match — worth confirming MLX lowering preserves the expected positional order before going this route.
Make temperature/top_p/seed positional-or-keyword in SamplingHead itself. That's the cleanest if SamplingHead is the canonical entry point, but it touches shared MLX code used elsewhere.

Given those trade-offs, keeping the thin wrapper local to export.py is defensible. A one-line comment on _MLXSampleWrapper saying "positional adapter — SamplingHead's sampling params are keyword-only and export needs them positional" would make the intent obvious and pre-empt this exact question.

2. main.cpp:47 — "Can default mean we generate base seed by calling a random int from the C++ library?"

Yes, and it would arguably be better UX. Today --seed defaults to 0, so every run is byte-for-byte identical even at temperature>0 — surprising for "sampling." A common pattern: treat seed < 0 (or a dedicated sentinel) as "draw a fresh base seed," e.g.

DEFINE_int64(seed, -1, "Base RNG seed; <0 = random per run. Runner increments per token.");
...
uint64_t base_seed = FLAGS_seed >= 0
    ? static_cast<uint64_t>(FLAGS_seed)
    : std::random_device{}();

Keep an explicit non-negative --seed reproducible (your verification relies on that). Mirror the same sentinel in run.py for parity. One caveat: if you go random-by-default, log the chosen seed so a run can still be reproduced.

Findings

Seed-schedule parity (verified, no bug). C++ prefill samples token 0 with seed_ == base, then seed_ += 1 so the first decode draws with base+1; each subsequent decode increments. run.py prefill samples with seed, then decode uses seed + len(generated_tokens) (== base+1 for the first decode since generated_tokens already holds the prefill token). These line up. Chunked prefill in C++ reuses seed_ across chunks but only the final chunk's token is kept, so it matches run.py's single-pass prefill. 👍

No validation of top_p range. Both runners and SamplingHead document top_p ∈ (0, 1] but nothing rejects 0, negative, or >1. top_p=0 in particular could produce a degenerate nucleus on-device. Consider a cheap guard in both main.cpp and run.py (and/or prefill_tokens/decode_one) alongside the existing temperature check. Low severity.

run.py use_sampling detection swallows all exceptions.

try:
    meta_method = program.load_method("use_sampling")
    ...
except Exception:
    pass

This is fine for an old .pte lacking the method, but it also silently hides a genuine load/execute failure on a sampling model (you'd fall back to logits-path and then crash on the 2-input forward with a confusing error). Not blocking, but narrowing the catch or logging at debug would aid debugging.

Minor — noqa: C901 added to two functions. _export_mlx and run_inference/main picked up complexity suppressions. Reasonable given the branching, but if run_inference keeps growing, the sampling vs. logits split (_sampling_scalars is already factored out) could move into a small helper. Non-blocking.

Defaults consistency. main.cpp --top_p default 1.0 and run.py --top-p default 1.0 match; rejection messages match. Good parity across the two runners.

Summary

No correctness bugs found — the seed schedule, chunking interaction, and cross-runner parity are sound. The wrapper is justified by SamplingHead's keyword-only signature (worth a clarifying comment). The two suggestions worth acting on: (a) optional top_p range validation, and (b) consider a random-by-default --seed per the reviewer, logging the chosen value for reproducibility.

kiymetakdemir and others added 2 commits June 25, 2026 20:11

Add on-device sampling to the Qwen3.5 MoE MLX runner

69e188c

Merge branch 'pytorch:main' into qwen-moe-ondevice-sampling

63802c8

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2026

kiymetakdemir marked this pull request as ready for review June 26, 2026 16:29

Merge branch 'main' into qwen-moe-ondevice-sampling

a4f2902

metascroy reviewed Jun 26, 2026

View reviewed changes

digantdesai changed the title ~~Qwen3.5 MoE ondevice sampling~~ [MLX] Qwen3.5 MoE ondevice sampling Jun 26, 2026

kiymetakdemir and others added 5 commits June 26, 2026 11:48

random base seed, top_p validation

a989a03

Merge branch 'main' into qwen-moe-ondevice-sampling

ce69c04

Merge branch 'main' into qwen-moe-ondevice-sampling

6e6c2d2

Qwen MLX runner: only randomize seed for sampling models

ce286df

Merge branch 'main' into qwen-moe-ondevice-sampling

2ad867d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLX] Qwen3.5 MoE ondevice sampling#20545

[MLX] Qwen3.5 MoE ondevice sampling#20545
kiymetakdemir wants to merge 8 commits into
pytorch:mainfrom
kiymetakdemir:qwen-moe-ondevice-sampling

kiymetakdemir commented Jun 26, 2026

Uh oh!

pytorch-bot Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

metascroy Jun 26, 2026

Uh oh!

kiymetakdemir Jun 26, 2026

Uh oh!

metascroy Jun 26, 2026

Uh oh!

metascroy Jun 26, 2026

Uh oh!

kiymetakdemir Jun 26, 2026

Uh oh!

metascroy commented Jun 26, 2026

Uh oh!

claude Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		model.forward = types.MethodType(_clean_forward, model)


		class _MLXSampleWrapper(nn.Module):

Uh oh!

Conversation

kiymetakdemir commented Jun 26, 2026

Uh oh!

pytorch-bot Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20545

❌ 4 Pending, 1 Unrelated Failure, 2 Unclassified Failures

Uh oh!

github-actions Bot commented Jun 26, 2026

This PR needs a release notes: label

Uh oh!

metascroy Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

kiymetakdemir Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

kiymetakdemir Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy commented Jun 26, 2026

Uh oh!

claude Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review: Qwen3.5 MoE on-device sampling

Answering the reviewer questions

Findings

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot Bot commented Jun 26, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 26, 2026 •

edited

Loading