Skip to content

Conversation

@chenaoxuan
Copy link
Contributor

@chenaoxuan chenaoxuan commented Dec 31, 2025

What this PR does / why we need it?

  1. MagicMTP (paper: "Block Verification Accelerates Speculative Decoding") was introduced to consider the influence among multiple draft tokens, improving the acceptance rate without compromising accuracy.
  2. Added Triton and PyTorch implementations, and added E2E test cases.

Does this PR introduce any user-facing change?

MagicMTP will automatically take effect when the parameter "num_speculative_tokens" >= 3.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the MagicMTP speculative decoding acceleration algorithm, adding both Triton and PyTorch implementations, along with end-to-end tests. The new algorithm is enabled when num_speculative_tokens is 3 or more. My review identified a critical logical error in the Triton kernel's implementation of the cumulative acceptance probability, which deviates from the reference PyTorch implementation and the MagicMTP paper, potentially leading to incorrect sampling results. Additionally, a critical typo was found in the new test file that would prevent it from running. I have provided code suggestions to address both of these critical issues.

device=DEVICE)
RECOVERED_TOKEN_IDS = torch.full((BATCH_SIZE,),
MAX_SPEC_LEN + 1,
detype=torch.int64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a typo in the keyword argument for torch.full. The argument detype should be dtype. This typo will raise a TypeError and cause the test to fail.

Suggested change
detype=torch.int64,
dtype=torch.int64,

Comment on lines 425 to 430
pi = min(pi * target_prob / draft_prob, 1.0)
if draft_prob > 0 and pi >= uniform_prob:
last_accepted_token_pos = pos
rejected = False
else:
rejected = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for calculating the cumulative acceptance probability pi is incorrect and does not match the block verification algorithm from the MagicMTP paper, nor the provided PyTorch reference implementation. The current formula pi = min(pi * target_prob / draft_prob, 1.0) incorrectly inflates the probability when target_prob / draft_prob > 1. The correct formula should be pi = pi * min(target_prob / draft_prob, 1.0).

Furthermore, the current implementation has unsafe handling of division by zero when draft_prob is 0. This relies on floating-point inf behavior and breaks the cumulative product logic in subsequent iterations.

The suggested change corrects the formula and handles the division-by-zero case safely, ensuring the logic is correct and robust.

Suggested change
pi = min(pi * target_prob / draft_prob, 1.0)
if draft_prob > 0 and pi >= uniform_prob:
last_accepted_token_pos = pos
rejected = False
else:
rejected = True
ratio = 0.0
if draft_prob > 0.0:
ratio = target_prob / draft_prob
pi = pi * tl.min(ratio, 1.0)
if pi >= uniform_prob:
last_accepted_token_pos = pos
rejected = False
else:
rejected = True

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@jianzs
Copy link
Collaborator

jianzs commented Dec 31, 2025

What performance improvement does this feature offer?

@chenaoxuan chenaoxuan force-pushed the magicmtp-013 branch 2 times, most recently from b400a85 to b84342b Compare December 31, 2025 06:44
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@chenaoxuan chenaoxuan force-pushed the magicmtp-013 branch 2 times, most recently from fedc6cd to 752d372 Compare January 6, 2026 06:53
@chenaoxuan chenaoxuan force-pushed the magicmtp-013 branch 3 times, most recently from 19101b0 to 00e2410 Compare January 6, 2026 09:03
@chenaoxuan
Copy link
Contributor Author

chenaoxuan commented Jan 7, 2026

Performance Improvement:

Environment:
CANNVERSION=CANN 8.5.0.B100
PTAVERSION=FrameworkPTAdapter 7.2.RC1.B130
VLLM_ASCEND_VERSION=e07938047e4b117258f8cb564a41c472d2b6e4ab
VLLM_VERSION=v0.13.0
MOONCAKE_VERSION=v0.3.7.post2
TORCH_VERSION=2.8.0

Parameter:
--max-num-seqs 32
--max-model-len 8192
--max-num-batched-tokens 4096
--data-parallel-size 2
--tensor-parallel-size 8

Data:
select 256 requests from gsm8k
max_out_len = 1500
temperature = 0.6
top_p = 0.95
ignore_eos=True

Model:
DeepSeek-R1_w8a8 + MTP = 3

Improvement:

  • Triton:
    TOPT: 76.44ms -> 73.12ms
    Acceptance Rate: 46.94% -> 51.37%
  • Torch(Only turn off Triton in rejection_sampler.py):
    TOPT: 76.87ms -> 75.19ms
    Acceptance Rate: 46.99% -> 51.17%

@chenaoxuan chenaoxuan closed this Jan 7, 2026
@chenaoxuan chenaoxuan reopened this Jan 7, 2026
@chenaoxuan
Copy link
Contributor Author

chenaoxuan commented Jan 7, 2026

What performance improvement does this feature offer?

Increase the draft token acceptance rate by accepting more potential draft tokens.
With Deepseek-R1 model and gsm8k dataset:
TOPT: 76.44ms -> 73.12ms
Acceptance Rate: 46.94% -> 51.37%

@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Jan 7, 2026
@chenaoxuan chenaoxuan force-pushed the magicmtp-013 branch 4 times, most recently from bcefa40 to 47186cd Compare January 7, 2026 10:19
@wangxiyuan wangxiyuan merged commit 8763953 into vllm-project:main Jan 8, 2026
16 checks passed
Rozwel-dx pushed a commit to Rozwel-dx/vllm-ascend that referenced this pull request Jan 8, 2026
vllm-project#5542)

### What this PR does / why we need it?

1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. Added Triton and PyTorch implementations, and added E2E test cases.

### Does this PR introduce _any_ user-facing change?
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@7157596

Signed-off-by: chenaoxuan <[email protected]>
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 8, 2026
…into FIA_rebase

* 'FIA_rebase' of https://github.com/845473182/vllm-ascend: (39 commits)
  [CI] Drop outdated cases (vllm-project#5709)
  [EPLB][CI] EPLB add aclgraph and redundant expert ci (vllm-project#5625)
  [CI] fix image build tag (vllm-project#5703)
  Optimize the print info format when deprecated code is used in vllm-ascend (vllm-project#5696)
  [Feature] add the magicmtp speculative decoding acceleration algorithm (vllm-project#5542)
  [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (vllm-project#4311)
  [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (vllm-project#5181)
  [BugFix][P/D] Fix pre-create link parameter error (vllm-project#5694)
  [Kernel] Add moe_gating_top_k operator support for Ascend NPU (vllm-project#5579)
  [1/N][CI] Refactor accuracy test (vllm-project#5400)
  [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5676)
  [Tests] Add qwen3-8b nightly test (vllm-project#5597)
  [Refactor] Import global var form vllm instead of overwirte it (vllm-project#5469)
  [Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (vllm-project#4870)
  [CI] move image and wheel job to schedule way (vllm-project#5685)
  [Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (vllm-project#5553)
  [Bugfix] fix resource are insufficient when pcp and piecewise (vllm-project#5377)
  [CI] Add workflow to cancel running workflows on PR close (vllm-project#5646)
  [CI] Bump lm-eval version to v0.4.9.2 (vllm-project#5655)
  [CI] cleanup single/multi-card test (vllm-project#5623)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants