Skip to content

Commit d109134

Browse files
LiuYi-Upzjks98
authored andcommitted
[Bugfix] bugfix for the order of dummy run pad and sync (vllm-project#5777)
### What this PR does / why we need it? This PR addresses an issue in piecewise graph mode when Multi-Threading Parallelism (MTP) is enabled. Specifically, the original dummy run sequence performs the following steps in order: 1. Sync DP (input length = 1 + k) 2. Dispatch (input length = 1 + k, with padding==graph size) However, in the model execution phase, the sequence differs, resulting in: 1. Padding (input length = 1, with padding) 2. Sync DP (input length = 1 + k) 3. Dispatch (input length 1 + k != graph size 1 + k, with padding) This discrepancy leads to a mismatch between the input sizes used in the model execution and those expected by the dispatch graph, causing an inconsistency in graph size. This PR ensures that the dispatch graph size aligns correctly by modifying the sequence of operations during model execution to match the dummy run sequence, resolving the mismatch issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: LiuYi-UP <[email protected]>
1 parent a23dfce commit d109134

File tree

1 file changed

+11
-5
lines changed

1 file changed

+11
-5
lines changed

vllm_ascend/worker/model_runner_v1.py

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2072,10 +2072,14 @@ def _dummy_run(
20722072
if self.is_kv_producer and not self.is_kv_consumer:
20732073
with_prefill = True
20742074

2075+
has_lora = True if self.lora_config and self.compilation_config.cudagraph_specialize_lora else False
2076+
_ag_mode, batch_descriptor = \
2077+
self.cudagraph_dispatcher.dispatch(num_tokens=num_tokens, uniform_decode=uniform_decode, has_lora=has_lora)
2078+
20752079
# Padding for DP
20762080
(num_tokens, num_tokens_across_dp,
2077-
with_prefill) = self._sync_metadata_across_dp(num_tokens,
2078-
with_prefill)
2081+
with_prefill) = self._sync_metadata_across_dp(
2082+
batch_descriptor.num_tokens, with_prefill)
20792083

20802084
# If cudagraph_mode.decode_mode() == FULL and
20812085
# cudagraph_mode.seperate_routine(). This means that we are using
@@ -2122,9 +2126,11 @@ def _dummy_run(
21222126
if not is_profile and self.dynamic_eplb:
21232127
self.eplb_updator.forward_before()
21242128

2125-
has_lora = True if self.lora_config and self.compilation_config.cudagraph_specialize_lora else False
2126-
_ag_mode, batch_descriptor = \
2127-
self.cudagraph_dispatcher.dispatch(num_tokens=num_tokens, uniform_decode=uniform_decode, has_lora=has_lora)
2129+
if num_tokens != batch_descriptor.num_tokens:
2130+
_ag_mode, batch_descriptor = self.cudagraph_dispatcher.dispatch(
2131+
num_tokens=num_tokens,
2132+
uniform_decode=uniform_decode,
2133+
has_lora=has_lora)
21282134

21292135
num_tokens_padded = batch_descriptor.num_tokens
21302136
num_reqs_padded = (batch_descriptor.num_reqs if

0 commit comments

Comments
 (0)