Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Nov 18, 2025

Purpose

A pr based on #28832 (which should be landed first)

After Flashinfer's update, we found that batch invariant support of FlashinferMLA is broken, having issue flashinfer-ai/flashinfer#2107 here, we don't want to use a for loop which will be very slow, so just ban the FlashinferMLA for a while. Update: even if using a for loop, still may cause some diff in logprob.

There is a bug in test

(EngineCore_DP0 pid=1348542)   File "/home/wentao/vllm-source/vllm/model_executor/models/qwen2.py", line 339, in <lambda>
(EngineCore_DP0 pid=1348542)     lambda prefix: decoder_layer_type(
(EngineCore_DP0 pid=1348542)                    ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1348542)   File "/home/wentao/vllm-source/vllm/model_executor/models/qwen3.py", line 185, in __init__
(EngineCore_DP0 pid=1348542)     self.self_attn = Qwen3Attention(
(EngineCore_DP0 pid=1348542)                      ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1348542)   File "/home/wentao/vllm-source/vllm/model_executor/models/qwen3.py", line 120, in __init__
(EngineCore_DP0 pid=1348542)     self.attn = Attention(
(EngineCore_DP0 pid=1348542)                 ^^^^^^^^^^
(EngineCore_DP0 pid=1348542)   File "/home/wentao/vllm-source/vllm/attention/layer.py", line 287, in __init__
(EngineCore_DP0 pid=1348542)     self.attn_backend = get_attn_backend(
(EngineCore_DP0 pid=1348542)                         ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1348542)   File "/home/wentao/vllm-source/vllm/attention/selector.py", line 90, in get_attn_backend
(EngineCore_DP0 pid=1348542)     return _cached_get_attn_backend(
(EngineCore_DP0 pid=1348542)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1348542)   File "/home/wentao/vllm-source/vllm/attention/selector.py", line 168, in _cached_get_attn_backend
(EngineCore_DP0 pid=1348542)     attention_cls = current_platform.get_attn_backend_cls(
(EngineCore_DP0 pid=1348542)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1348542)   File "/home/wentao/vllm-source/vllm/platforms/cuda.py", line 372, in get_attn_backend_cls
(EngineCore_DP0 pid=1348542)     raise ValueError(
(EngineCore_DP0 pid=1348542) ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN_MLA is not valid for this configuration. Reason: ['head_size not supported', 'non-MLA not supported']

This PR fixes that as well.

Test

Now everything green

Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
@mergify mergify bot added the v1 label Nov 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a bug in the batch invariant MLA test by disabling problematic backends and dynamically selecting a compatible model for MLA tests. The changes are logical and align with the PR's goal. My review includes one suggestion to improve the maintainability of the test code by refactoring duplicated logic.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: yewentao256 <[email protected]>
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants