-
Notifications
You must be signed in to change notification settings - Fork 716
[main][bugfix] Fix fullgraph padding bug in mtp eagle refactor #5692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in the padding logic for full graph mode, specifically in scenarios involving MTP (Multi-Token Prediction) and PCP (Prefill Context Parallelism). The modification correctly constrains the padding condition by ensuring that the number of input tokens does not exceed the maximum size of a captured graph. By using min(max_decode_tokens, self.cudagraph_batch_sizes[-1]), the change prevents the padding logic from being erroneously applied to batches that are too large for graph replay and would fall back to eager execution. This is a solid bug fix that enhances the robustness of the full graph execution path, particularly for cases with manually configured graph capture sizes.
| max_decode_tokens = self.scheduler_config.max_num_seqs * self.uniform_decode_query_len | ||
| if self.compilation_config.cudagraph_mode.decode_mode() == CUDAGraphMode.FULL and \ | ||
| uniform_decode and self.uniform_decode_query_len <= num_input_tokens <= max_decode_tokens: | ||
| uniform_decode and self.uniform_decode_query_len <= num_input_tokens <= min(max_decode_tokens, self.cudagraph_batch_sizes[-1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest
max_decode_tokens = min(self.scheduler_config.max_num_seqs * self.uniform_decode_query_len, self.cudagraph_batch_sizes[-1])It's more resonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: lilinsiman <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
5396049 to
103d8c4
Compare
Signed-off-by: lilinsiman <[email protected]>
What this PR does / why we need it?
The condition for determining padding in the fullgraph overlay with MTP and PCP has been modified to accommodate corner cases where the shape capture size is manually specified.
Does this PR introduce any user-facing change?
no
How was this patch tested?
ut and tests