[Fix] Pads query_start_loc to satisfy FIA/TND constraint #6357

yiz-liu · 2026-01-28T09:30:25Z

What this PR does / why we need it?

This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes.

We currently place this helper in execute_model. My original design was to include it in _prepare_inputs, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

Test cases added.

vLLM version: v0.14.1
vLLM main: vllm-project/vllm@dc917cc

gemini-code-assist

Code Review

This pull request introduces a fix to satisfy a layout constraint for the FIA/TND operator in full cudagraph mode by padding the query_start_loc buffer. The changes include increasing the buffer size, centralizing the padding logic into a new helper function _pad_query_start_loc_for_fia, and adding an assertion to ensure the constraint is met. This is a good improvement for correctness and maintainability.

I've found one critical issue in the new helper function related to a slice mismatch that would cause a runtime error. Please see my specific comment for details and a suggested fix.

vllm_ascend/worker/model_runner_v1.py

github-actions · 2026-01-28T09:36:27Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

wangxiyuan

enable the e2e test the same as #6284 ?

yiz-liu

Enable other skipped test cases
Check other persistent buffer
Check speculative decoding (workaround + reuse)

yiz-liu · 2026-01-29T08:12:02Z

vllm_ascend/worker/model_runner_v1.py

+        if num_tokens_padded == num_reqs_padded * self.uniform_decode_query_len:
+            # Uniform-batch case: num_reqs must be no greater than num_reqs_padded
+            assert num_reqs <= num_reqs_padded
+
+            last_loc = self.query_start_loc.np[num_reqs]
+            self.query_start_loc.np[num_reqs + 1 : num_reqs_padded + 1] = (
+                self.arange_np[1 : num_reqs_padded + 1 - num_reqs]
+                * self.uniform_decode_query_len
+                + last_loc
+            )


[0, 1, 2] -> [0, 1, 2, 3, 4]

yiz-liu · 2026-01-29T08:12:34Z

vllm_ascend/worker/model_runner_v1.py

+        else:
+            # Mixed-batch case: num_reqs must equal num_reqs_padded
+            assert num_reqs == num_reqs_padded
+
+            # Insert a dummy request instead of setting query_start_loc[num_reqs] = num_tokens_padded directly
+            self.query_start_loc.np[num_reqs_padded + 1] = num_tokens_padded
+            num_reqs_padded = num_reqs_padded + 1
+
+        self.query_start_loc.copy_to_gpu()


[0, 3] -> [0, 3, 4]

yiz-liu · 2026-01-29T08:14:12Z

vllm_ascend/worker/model_runner_v1.py

+        # NOTE: For FULL mode we change +1 to +2 to reserve extra space for padding.
+        # See _pad_query_start_loc_for_fia.
+        self.query_start_loc = self._make_buffer(
+            self.max_num_reqs + 2, dtype=torch.int32  # type: ignore[has-type]
+        )


Check if other buffers should be extended too.

This strange, no error by now.

yiz-liu · 2026-01-29T08:19:53Z

vllm_ascend/worker/model_runner_v1.py

+                if cudagraph_mode != CUDAGraphMode.NONE:
+                    num_reqs_padded = self._pad_query_start_loc_for_fia(
+                        num_tokens_padded, num_reqs_padded, num_reqs
+                    )
+


Maybe current_platform.post_process_after_padding

Adds a larger reserve (+2) for the query_start_loc buffer in FULL cudagraph mode and introduces a helper to pad it so the first dimension of hidden_states equals the final element of actual_seq_lengths_q required by the FIA/TND operator. Handles both uniform and mixed batches (inserting a dummy request for mixed batches), moves ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building attention metadata. These changes prevent kernel mismatches/failures and ensure correct shapes for FIA/TND execution in full graph modes. Signed-off-by: Yizhou Liu <[email protected]>

Signed-off-by: Yizhou Liu <[email protected]>

…m-project#6357)" This reverts commit 56f5d3b. Signed-off-by: wangli <[email protected]>

…onstraint (#6459) This reverts commit 56f5d3b. ### What this PR does / why we need it? The patch #6357 which break the functionality availability in the spec_decode scenario, let's revert and make CI happy first ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: wangli <[email protected]>

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

vllm_ascend/worker/model_runner_v1.py Outdated Show resolved Hide resolved

yiz-liu force-pushed the fix-full branch 2 times, most recently from bf4cfa6 to 9b1c177 Compare January 29, 2026 02:43

yiz-liu marked this pull request as ready for review January 29, 2026 03:26

yiz-liu requested review from MengqingCao and wangxiyuan as code owners January 29, 2026 03:26

yiz-liu added ready read for review ready-for-test start test by label for PR labels Jan 29, 2026

wangxiyuan reviewed Jan 29, 2026

View reviewed changes

yiz-liu commented Jan 29, 2026

View reviewed changes

wangxiyuan approved these changes Jan 29, 2026

View reviewed changes

yiz-liu added 2 commits January 30, 2026 13:53

Add tests for Qwen/DeepSeek with FULL mode

2ff2023

Signed-off-by: Yizhou Liu <[email protected]>

yiz-liu force-pushed the fix-full branch from 15d4c6e to 2ff2023 Compare January 30, 2026 05:54

yiz-liu mentioned this pull request Jan 30, 2026

[main][Bugfix] align actual_seq_lengths_q with runtime_shape for TND layout #6285

Open

yiz-liu merged commit 56f5d3b into vllm-project:main Jan 30, 2026
26 checks passed

yiz-liu deleted the fix-full branch January 30, 2026 08:41

Potabk added a commit to Potabk/vllm-ascend that referenced this pull request Jan 31, 2026

Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (vll…

07dce23

…m-project#6357)" This reverts commit 56f5d3b. Signed-off-by: wangli <[email protected]>

Potabk mentioned this pull request Jan 31, 2026

[ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Pads query_start_loc to satisfy FIA/TND constraint #6357

[Fix] Pads query_start_loc to satisfy FIA/TND constraint #6357

Uh oh!

yiz-liu commented Jan 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Jan 28, 2026

Uh oh!

wangxiyuan left a comment

Uh oh!

yiz-liu left a comment

Uh oh!

yiz-liu Jan 29, 2026

Uh oh!

yiz-liu Jan 29, 2026

Uh oh!

yiz-liu Jan 29, 2026

Uh oh!

yiz-liu Jan 29, 2026

Uh oh!

yiz-liu Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Fix] Pads query_start_loc to satisfy FIA/TND constraint #6357

[Fix] Pads query_start_loc to satisfy FIA/TND constraint #6357

Uh oh!

Conversation

yiz-liu commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Jan 28, 2026

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

yiz-liu left a comment

Choose a reason for hiding this comment

Uh oh!

yiz-liu Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

yiz-liu Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

yiz-liu Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

yiz-liu Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

yiz-liu Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yiz-liu commented Jan 28, 2026 •

edited

Loading