[Feature] add the magicmtp speculative decoding acceleration algorithm #5542

chenaoxuan · 2025-12-31T01:52:51Z

What this PR does / why we need it?

MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy.
Added Triton and PyTorch implementations, and added E2E test cases.

Does this PR introduce any user-facing change?

MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3.

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@7157596

gemini-code-assist

Code Review

This pull request introduces the MagicMTP speculative decoding acceleration algorithm, adding both Triton and PyTorch implementations, along with end-to-end tests. The new algorithm is enabled when num_speculative_tokens is 3 or more. My review identified a critical logical error in the Triton kernel's implementation of the cumulative acceptance probability, which deviates from the reference PyTorch implementation and the MagicMTP paper, potentially leading to incorrect sampling results. Additionally, a critical typo was found in the new test file that would prevent it from running. I have provided code suggestions to address both of these critical issues.

gemini-code-assist · 2025-12-31T01:55:03Z

tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_block_verify_rejection_sampler.py

+                             device=DEVICE)
+RECOVERED_TOKEN_IDS = torch.full((BATCH_SIZE,),
+                                 MAX_SPEC_LEN + 1,
+                                 detype=torch.int64,


There is a typo in the keyword argument for torch.full. The argument detype should be dtype. This typo will raise a TypeError and cause the test to fail.

Suggested change

detype=torch.int64,

dtype=torch.int64,

gemini-code-assist · 2025-12-31T01:55:03Z

vllm_ascend/ops/triton/reject_sample.py

+        pi = min(pi * target_prob / draft_prob, 1.0)
+        if draft_prob > 0 and pi >= uniform_prob:
+            last_accepted_token_pos = pos
+            rejected = False
+        else:
+            rejected = True


The logic for calculating the cumulative acceptance probability pi is incorrect and does not match the block verification algorithm from the MagicMTP paper, nor the provided PyTorch reference implementation. The current formula pi = min(pi * target_prob / draft_prob, 1.0) incorrectly inflates the probability when target_prob / draft_prob > 1. The correct formula should be pi = pi * min(target_prob / draft_prob, 1.0).

Furthermore, the current implementation has unsafe handling of division by zero when draft_prob is 0. This relies on floating-point inf behavior and breaks the cumulative product logic in subsequent iterations.

The suggested change corrects the formula and handles the division-by-zero case safely, ensuring the logic is correct and robust.

Suggested change

pi = min(pi * target_prob / draft_prob, 1.0)

if draft_prob > 0 and pi >= uniform_prob:

last_accepted_token_pos = pos

rejected = False

else:

rejected = True

ratio = 0.0

if draft_prob > 0.0:

ratio = target_prob / draft_prob

pi = pi * tl.min(ratio, 1.0)

if pi >= uniform_prob:

last_accepted_token_pos = pos

rejected = False

else:

rejected = True

github-actions · 2025-12-31T02:18:01Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

jianzs · 2025-12-31T03:46:48Z

What performance improvement does this feature offer?

github-actions · 2026-01-05T08:51:21Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

chenaoxuan · 2026-01-07T02:39:16Z

Performance Improvement:

Environment:
CANNVERSION=CANN 8.5.0.B100
PTAVERSION=FrameworkPTAdapter 7.2.RC1.B130
VLLM_ASCEND_VERSION=e07938047e4b117258f8cb564a41c472d2b6e4ab
VLLM_VERSION=v0.13.0
MOONCAKE_VERSION=v0.3.7.post2
TORCH_VERSION=2.8.0

Parameter:
--max-num-seqs 32
--max-model-len 8192
--max-num-batched-tokens 4096
--data-parallel-size 2
--tensor-parallel-size 8

Data:
select 256 requests from gsm8k
max_out_len = 1500
temperature = 0.6
top_p = 0.95
ignore_eos=True

Model:
DeepSeek-R1_w8a8 + MTP = 3

Improvement:

Triton:
TOPT: 76.44ms -> 73.12ms
Acceptance Rate: 46.94% -> 51.37%
Torch(Only turn off Triton in rejection_sampler.py):
TOPT: 76.87ms -> 75.19ms
Acceptance Rate: 46.99% -> 51.17%

chenaoxuan · 2026-01-07T02:57:04Z

What performance improvement does this feature offer?

Increase the draft token acceptance rate by accepting more potential draft tokens.
With Deepseek-R1 model and gsm8k dataset:
TOPT: 76.44ms -> 73.12ms
Acceptance Rate: 46.94% -> 51.37%

Signed-off-by: chenaoxuan <[email protected]>

vllm-project#5542) ### What this PR does / why we need it? 1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy. 2. Added Triton and PyTorch implementations, and added E2E test cases. ### Does this PR introduce _any_ user-facing change? MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@7157596 Signed-off-by: chenaoxuan <[email protected]>

…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: (39 commits) [CI] Drop outdated cases (vllm-project#5709) [EPLB][CI] EPLB add aclgraph and redundant expert ci (vllm-project#5625) [CI] fix image build tag (vllm-project#5703) Optimize the print info format when deprecated code is used in vllm-ascend (vllm-project#5696) [Feature] add the magicmtp speculative decoding acceleration algorithm (vllm-project#5542) [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (vllm-project#4311) [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (vllm-project#5181) [BugFix][P/D] Fix pre-create link parameter error (vllm-project#5694) [Kernel] Add moe_gating_top_k operator support for Ascend NPU (vllm-project#5579) [1/N][CI] Refactor accuracy test (vllm-project#5400) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5676) [Tests] Add qwen3-8b nightly test (vllm-project#5597) [Refactor] Import global var form vllm instead of overwirte it (vllm-project#5469) [Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (vllm-project#4870) [CI] move image and wheel job to schedule way (vllm-project#5685) [Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (vllm-project#5553) [Bugfix] fix resource are insufficient when pcp and piecewise (vllm-project#5377) [CI] Add workflow to cancel running workflows on PR close (vllm-project#5646) [CI] Bump lm-eval version to v0.4.9.2 (vllm-project#5655) [CI] cleanup single/multi-card test (vllm-project#5623) ...

chenaoxuan force-pushed the magicmtp-013 branch from ed58a0b to 3d64b0e Compare December 31, 2025 01:54

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

chenaoxuan force-pushed the magicmtp-013 branch 2 times, most recently from 47958ae to 395a8ec Compare December 31, 2025 02:26

github-actions bot added module:tests module:ops labels Dec 31, 2025

chenaoxuan force-pushed the magicmtp-013 branch 2 times, most recently from b400a85 to b84342b Compare December 31, 2025 06:44

github-actions bot added the merge-conflicts label Jan 5, 2026

chenaoxuan force-pushed the magicmtp-013 branch 2 times, most recently from fedc6cd to 752d372 Compare January 6, 2026 06:53

github-actions bot removed the merge-conflicts label Jan 6, 2026

chenaoxuan force-pushed the magicmtp-013 branch 3 times, most recently from 19101b0 to 00e2410 Compare January 6, 2026 09:03

chenaoxuan closed this Jan 7, 2026

chenaoxuan reopened this Jan 7, 2026

zzzzwwjj approved these changes Jan 7, 2026

View reviewed changes

yiz-liu added ready read for review ready-for-test start test by label for PR labels Jan 7, 2026

chenaoxuan force-pushed the magicmtp-013 branch 4 times, most recently from bcefa40 to 47186cd Compare January 7, 2026 10:19

[Feature] add the magicmtp speculative decoding acceleration algorithm

80a1293

Signed-off-by: chenaoxuan <[email protected]>

chenaoxuan force-pushed the magicmtp-013 branch from 47186cd to 80a1293 Compare January 7, 2026 12:27

wangxiyuan merged commit 8763953 into vllm-project:main Jan 8, 2026
16 checks passed

zzzzwwjj mentioned this pull request Jan 8, 2026

[feature]dcp&pcp support mlapo #5672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] add the magicmtp speculative decoding acceleration algorithm #5542

[Feature] add the magicmtp speculative decoding acceleration algorithm #5542

Uh oh!

chenaoxuan commented Dec 31, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 31, 2025

Uh oh!

gemini-code-assist bot Dec 31, 2025

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

jianzs commented Dec 31, 2025

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

chenaoxuan commented Jan 7, 2026 •

edited

Loading

Uh oh!

chenaoxuan commented Jan 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Feature] add the magicmtp speculative decoding acceleration algorithm #5542

[Feature] add the magicmtp speculative decoding acceleration algorithm #5542

Uh oh!

Conversation

chenaoxuan commented Dec 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

jianzs commented Dec 31, 2025

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

chenaoxuan commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Improvement:

Uh oh!

chenaoxuan commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chenaoxuan commented Dec 31, 2025 •

edited by github-actions bot

Loading

chenaoxuan commented Jan 7, 2026 •

edited

Loading

chenaoxuan commented Jan 7, 2026 •

edited

Loading