Skip to content

[V1][P/D]Bug fix: handle edge case where KVConnectorOutput is None #22473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

liuzijing2014
Copy link
Collaborator

@liuzijing2014 liuzijing2014 commented Aug 7, 2025

Purpose

From #21980, we refactor the KV connector path but we miss an edge case where KV connector loads kv cache async, and for some model iteration, there is no kv connector output at all.

if (not kv_connector_output.finished_sending
and not kv_connector_output.finished_recving):
return EMPTY_MODEL_RUNNER_OUTPUT

vllm/vllm/v1/outputs.py

Lines 120 to 127 in 7e3a8dc

EMPTY_MODEL_RUNNER_OUTPUT = ModelRunnerOutput(req_ids=[],
req_id_to_index={},
sampled_token_ids=[],
spec_token_ids=None,
logprobs=None,
prompt_logprobs_dict={},
pooler_output=[],
num_nans_in_logits=None)

Test

Before

(EngineCore_0 pid=2913251) Traceback (most recent call last):
(EngineCore_0 pid=2913251)   File "/usr/local/fbcode/platform010/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=2913251)     self.run()
(EngineCore_0 pid=2913251)   File "/usr/local/fbcode/platform010/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=2913251)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/v1/engine/core.py", line 687, in run_engine_core
(EngineCore_0 pid=2913251)     raise e
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/v1/engine/core.py", line 676, in run_engine_core
(EngineCore_0 pid=2913251)     engine_core.run_busy_loop()
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/v1/engine/core.py", line 703, in run_busy_loop
(EngineCore_0 pid=2913251)     self._process_engine_step()
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/v1/engine/core.py", line 728, in _process_engine_step
(EngineCore_0 pid=2913251)     outputs, model_executed = self.step_fn()
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/v1/engine/core.py", line 273, in step
(EngineCore_0 pid=2913251)     model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/v1/engine/core.py", line 259, in execute_model_with_error_logging
(EngineCore_0 pid=2913251)     raise err
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/v1/engine/core.py", line 250, in execute_model_with_error_logging
(EngineCore_0 pid=2913251)     return model_fn(scheduler_output)
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/v1/executor/multiproc_executor.py", line 192, in execute_model
(EngineCore_0 pid=2913251)     return self.kv_output_aggregator.aggregate(outputs, self.output_rank)
(EngineCore_0 pid=2913251)   File "/data/users/zijingliu/fbsource/buck-out/v2/gen/fbcode/603862b2f935f1b2/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/distributed/kv_transfer/kv_connector/utils.py", line 146, in aggregate
(EngineCore_0 pid=2913251)     update_finished_set(output.finished_sending,
(EngineCore_0 pid=2913251) AttributeError: 'NoneType' object has no attribute 'finished_sending'

After
P/D disagg could run properly

Eval

[2025-08-07 14:50:41,290] [rank 0] [INFO] Per prompt detailed info dumped to /tmp/eval_dump.gsm8k.8_shot.1_gen.20250807_145041.json
[2025-08-07 14:50:41,290] [rank 0] [INFO] Evaluation results on task gsm8k.8_shot.1_gen: em: 0.948000 | f1: 0.948000 | em_maj1@1: 0.948000 | f1_maj1@1: 0.948000
[2025-08-07 14:50:41,290] [rank 0] [INFO] Task gsm8k.8_shot.1_gen took 189.51 seconds

cc @sdavidbd @njhill @houseroad

Copy link

github-actions bot commented Aug 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@liuzijing2014 liuzijing2014 marked this pull request as ready for review August 7, 2025 21:55
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a critical bug that occurred when aggregating KV connector outputs, where KVConnectorOutput could be None, leading to an AttributeError. The fix correctly adds a None check before accessing attributes of kv_connector_output, making the aggregation logic more robust. The change is correct and effectively resolves the issue. I've added one comment suggesting the addition of a unit test to cover this new edge case and prevent future regressions.

Comment on lines 145 to 150
kv_connector_output = output.kv_connector_output
if kv_connector_output:
update_finished_set(kv_connector_output.finished_sending,
self._send_remaining_count, finished_sending)
update_finished_set(kv_connector_output.finished_recving,
self._recv_remaining_count, finished_recving)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change correctly fixes a critical bug where kv_connector_output could be None, causing a crash. However, to prevent future regressions, it's important to add a unit test that covers this specific edge case.

The existing tests in tests/v1/kv_connector/unit/test_output_aggreagator.py do not seem to cover the scenario where kv_connector_output is None. Please consider adding a new test case that asserts the correct behavior when one or more of the ModelRunnerOutput objects in the outputs list has kv_connector_output=None.

@sdavidbd
Copy link
Contributor

sdavidbd commented Aug 7, 2025

Thanks for the catch, @liuzijing2014! My original intent was that the KV connector path would always produce a KVConnectorOutput, so this case wasn’t expected.

Instead of allowing None, would it make sense to define a default EMPTY_MODEL_RUNNER_WITH_KVC_OUTPUT for such scenarios?

Also, note that a similar issue exists in gpu_worker.py for PP:
https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_worker.py#L380

@liuzijing2014
Copy link
Collaborator Author

@sdavidbd Make sense to me, let me create a new empty default then.

Signed-off-by: Zijing Liu <[email protected]>
@njhill
Copy link
Member

njhill commented Aug 8, 2025

Maybe it would be simpler / less fragile to just add a check in the aggregator here:

output = output.kv_connector_output
?

if not output:
    continue

@sdavidbd
Copy link
Contributor

sdavidbd commented Aug 9, 2025

Maybe it would be simpler / less fragile to just add a check in the aggregator here:

output = output.kv_connector_output

?

if not output:
    continue

Good point, @njhill - adding a check in the aggregator would certainly be simpler and less fragile in the short term.
My thinking was that having an EMPTY_MODEL_RUNNER_WITH_KVC_OUTPUT would keep the invariant that the KV connector path always returns a KVConnectorOutput, which might make the downstream logic cleaner and avoid scattered None checks.

Either way works — it’s just a trade-off between a quick fix and enforcing that invariant in the data model.

@CaveNightingale
Copy link

CaveNightingale commented Aug 11, 2025

What's the point keeping EMPTY_MODEL_RUNNER_OUTPUT?
It seems the only usages of this variable are {TPUModelRunner, GPUModelRunner}::execute_model and I think they can also be replaced with the new EMPTY_MODEL_RUNNER_WITH_KVC_OUTPUT.
TPUModelRunner::execute_model GPUModelRunner::execute_model

@liuzijing2014
Copy link
Collaborator Author

How about replacing EMPTY_MODEL_RUNNER_OUTPUT withEMPTY_MODEL_RUNNER_WITH_KVC_OUTPUT and make this the new convention for emtpy default?

@njhill
Copy link
Member

njhill commented Aug 11, 2025

We have another PR here with the simpler fix amongst others.

@sdavidbd I understand the invariant idea but it also means we're often allocating an empty KVConnectorOutput unnecessarily. And I guess it doesn't harm to be defensive with the checks.

@sdavidbd
Copy link
Contributor

Thanks, @njhill . I see where you’re coming from, but I’m a bit hesitant about adding those checks — my concern is they might end up masking logic bugs rather than helping us catch them early.

My preference would be to keep the stronger invariant in both directions: on the KV-connector path we always yield a KVConnectorOutput, and off that path we never do (connector is not Nonekv_connector_output is not None).

On the allocation concern — with the new KV-connector context manager added in #21980 we already create an empty KVConnectorOutput, but I believe the cost is negligible in both CPU and memory terms since it’s just a small dataclass with Optional fields defaulting to None.

The change in this PR targets the no‑forward path (kv_connector_no_forward). There, we avoid creating a full ModelRunnerOutput and only copy EMPTY_MODEL_RUNNER_OUTPUT if the KVConnectorOutput is non‑empty. If it’s empty, we return EMPTY_MODEL_RUNNER_WITH_KVC_OUTPUT — a constant “empty” ModelRunnerOutput that still carries an empty KVConnectorOutput.

I could be missing something, but I think keeping the invariant will make it easier to spot real issues and avoid unnecessary None checks scattered through the code.

@robertgshaw2-redhat
Copy link
Collaborator

slved by #22663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants