Skip to content

[FEAT] [ROCm] [V1]: Add AITER biased group topk for DeepSeekV3 #17955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 14, 2025

Conversation

vllmellm
Copy link
Contributor

@vllmellm vllmellm commented May 11, 2025

This PR add the use of biased_group_topk kernel for DeepSeekV3

Performance Comparison Summary

Metric V0 Before V0 After Change (%) V1 Before V1 After Change (%)
Benchmark Duration (s) 168.90 163.00 -3.5% 144.65 137.63 -4.9%
Request Throughput (req/s) 2.96 3.07 +3.7% 3.46 3.63 +4.9%
Output Token Throughput (tok/s) 836.47 951.74 +13.8% 954.81 918.66 -3.8%
Total Token Throughput (tok/s) 3757.01 3977.94 +5.9% 4365.06 4502.65 +3.2%
Mean TTFT (ms) 23508.33 21869.51 -7.0% 24366.97 22886.85 -6.1%
Median TTFT (ms) 24786.55 21003.01 -15.3% 24778.90 23169.19 -6.5%
P99 TTFT (ms) 40458.17 37903.93 -6.3% 43829.83 42134.00 -3.9%
Mean ITL (ms) 111.66 104.64 -6.3% 86.47 86.98 +0.6%
Median ITL (ms) 77.75 75.21 -3.3% 57.22 54.72 -4.4%
P99 ITL (ms) 101.53 87.36 -14.0% 2635.71 2577.23 -2.2%

Key Observations:

  • Both V0 and V1 show performance improvements after changes
  • V0 shows significant improvement in output token throughput (+13.8%)
  • V1 has better overall throughput metrics than V0 both before and after changes
  • Time to First Token (TTFT) improved in both versions, with V0 showing more significant improvement in median TTFT (-15.3%)
  • Inter-token Latency (ITL) improved in most metrics across both versions

Accuracy Validation

  • AITER V0

vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9447 ± 0.0063
strict-match 5 exact_match 0.9447 ± 0.0063
  • AITER V1

vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9484 ± 0.0061
strict-match 5 exact_match 0.9462 ± 0.0062

V1 custom op of registration unit tests

The unit tests tests for the following criteria

  • the custom ops are registered successfully
  • the fake tensor definition of the custom ops are correct
  • the output generated in non-torch compile and torch compile mode are the same

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@vllmellm vllmellm changed the title [FEAT] [ROCm]: Add AITER biased group topk for DeepSeekV3 [FEAT] [ROCm] [V1]: Add AITER biased group topk for DeepSeekV3 May 11, 2025
@houseroad houseroad added the rocm Related to AMD ROCm label May 12, 2025
@houseroad
Copy link
Collaborator

Fix the linter (maybe need rebase)?

Signed-off-by: tjtanaa <[email protected]>
@tjtanaa
Copy link
Contributor

tjtanaa commented May 13, 2025

Fix the linter (maybe need rebase)?

@houseroad It has been fixed.

topk_group=topk_group,
scoring_func=scoring_func,
e_score_correction_bias=e_score_correction_bias)
if (rocm_aiter_biased_grouped_topk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a rocm_aiter_biased_group_topk wrapper in rocm_aiter_fused_moe that has the same interface as grouped_topk? That way we can do something like:

if is_rocm_aiter_moe_enabled():
    import torch.ops.vllm.rocm_aiter_biased_grouped_topk as grouped_topk
else:
   from vllm.model_executor.layers.fused_moe.fused_moe import grouped_topk

At the top of the file? That would remove all of the dispatching logic here AFAICT. It looks like the only dead argument you would have is scoring_func which I guess is assumed by the AITER kernel? If so, asserting it's "softmax" inside the wrapper would be good.

tjtanaa added 2 commits May 13, 2025 16:15
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Copy link

mergify bot commented May 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vllmellm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 13, 2025
@mergify mergify bot removed the needs-rebase label May 13, 2025
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label May 14, 2025
@vllm-bot vllm-bot merged commit 2d912fb into vllm-project:main May 14, 2025
84 of 88 checks passed
aarnphm pushed a commit to aarnphm/vllm that referenced this pull request May 15, 2025
gshtras pushed a commit to ROCm/vllm that referenced this pull request May 15, 2025
@tjtanaa tjtanaa deleted the aiter-biased-group-topk branch May 16, 2025 16:28
huachenheli pushed a commit to huachenheli/vllm that referenced this pull request May 22, 2025
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants