Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Jan 28, 2026

What this PR does / why we need it?

This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes.

We currently place this helper in execute_model. My original design was to include it in _prepare_inputs, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

Test cases added.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix to satisfy a layout constraint for the FIA/TND operator in full cudagraph mode by padding the query_start_loc buffer. The changes include increasing the buffer size, centralizing the padding logic into a new helper function _pad_query_start_loc_for_fia, and adding an assertion to ensure the constraint is met. This is a good improvement for correctness and maintainability.

I've found one critical issue in the new helper function related to a slice mismatch that would cause a runtime error. Please see my specific comment for details and a suggested fix.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@yiz-liu yiz-liu force-pushed the fix-full branch 2 times, most recently from bf4cfa6 to 9b1c177 Compare January 29, 2026 02:43
@yiz-liu yiz-liu marked this pull request as ready for review January 29, 2026 03:26
@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Jan 29, 2026
Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable the e2e test the same as #6284 ?

Copy link
Collaborator Author

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Enable other skipped test cases
  2. Check other persistent buffer
  3. Check speculative decoding (workaround + reuse)

Comment on lines +527 to +536
if num_tokens_padded == num_reqs_padded * self.uniform_decode_query_len:
# Uniform-batch case: num_reqs must be no greater than num_reqs_padded
assert num_reqs <= num_reqs_padded

last_loc = self.query_start_loc.np[num_reqs]
self.query_start_loc.np[num_reqs + 1 : num_reqs_padded + 1] = (
self.arange_np[1 : num_reqs_padded + 1 - num_reqs]
* self.uniform_decode_query_len
+ last_loc
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0, 1, 2] -> [0, 1, 2, 3, 4]

Comment on lines +537 to +545
else:
# Mixed-batch case: num_reqs must equal num_reqs_padded
assert num_reqs == num_reqs_padded

# Insert a dummy request instead of setting query_start_loc[num_reqs] = num_tokens_padded directly
self.query_start_loc.np[num_reqs_padded + 1] = num_tokens_padded
num_reqs_padded = num_reqs_padded + 1

self.query_start_loc.copy_to_gpu()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0, 3] -> [0, 3, 4]

Comment on lines +210 to +214
# NOTE: For FULL mode we change +1 to +2 to reserve extra space for padding.
# See _pad_query_start_loc_for_fia.
self.query_start_loc = self._make_buffer(
self.max_num_reqs + 2, dtype=torch.int32 # type: ignore[has-type]
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if other buffers should be extended too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strange, no error by now.

Comment on lines +1244 to +1248
if cudagraph_mode != CUDAGraphMode.NONE:
num_reqs_padded = self._pad_query_start_loc_for_fia(
num_tokens_padded, num_reqs_padded, num_reqs
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe current_platform.post_process_after_padding

Adds a larger reserve (+2) for the query_start_loc buffer in FULL cudagraph mode and introduces a helper to pad it so the first dimension of hidden_states equals the final element of actual_seq_lengths_q required by the FIA/TND operator.

Handles both uniform and mixed batches (inserting a dummy request for mixed batches), moves ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building attention metadata. These changes prevent kernel mismatches/failures and ensure correct shapes for FIA/TND execution in full graph modes.

Signed-off-by: Yizhou Liu <[email protected]>
@yiz-liu yiz-liu merged commit 56f5d3b into vllm-project:main Jan 30, 2026
26 checks passed
@yiz-liu yiz-liu deleted the fix-full branch January 30, 2026 08:41
Potabk added a commit to Potabk/vllm-ascend that referenced this pull request Jan 31, 2026
wangxiyuan pushed a commit that referenced this pull request Jan 31, 2026
…onstraint (#6459)

This reverts commit 56f5d3b.

### What this PR does / why we need it?
The patch #6357 which
break the functionality availability in the spec_decode scenario, let's
revert and make CI happy first
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: wangli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants