Skip to content

Conversation

@destinysky
Copy link
Contributor

@destinysky destinysky commented Jan 7, 2026

What this PR does / why we need it?
This PR fixes a bug in NetLoader PR#2888. The bug was caused by PR#3612 ([1/N][Refactor] Refactor code to adapt with vllm main), which removed the stateless_init_device_torch_dist_pg function from platform.py, leading to a failure in the call. This PR adds a way to create a stateless process group that does not depend on external code.

Does this PR introduce any user-facing change?
No

How was this patch tested?
Same with PR#2888

What this PR does / why we need it?
This PR fixes a bug in NetLoader PR#2888 (vllm-project#2888). The bug was caused by PR#3612 ([1/N][Refactor] Refactor code to adapt with vllm main) (vllm-project#3612), which removed the `stateless_init_device_torch_dist_pg` function from platform.py, leading to a failure in the call. This PR adds a way to create a stateless process group that does not depend on external code.

Does this PR introduce any user-facing change?
No

How was this patch tested?
Same with PR#2888


Signed-off-by: destinysky <[email protected]>
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in NetLoader caused by a previous refactoring that removed necessary process group initialization logic. The fix introduces a new file, vllm_ascend/model_loader/netloader/executor/netloader_pg.py, which contains self-contained functions (stateless_init_process_group and destroy_stateless_process_group) for managing stateless HCCL process groups on NPU devices. The changes in elastic_load.py simply adopt these new utility functions. My review focuses on the new implementation in netloader_pg.py, where I've found a couple of areas for improvement regarding code correctness and robustness.

Signed-off-by: destinysky <[email protected]>
@destinysky destinysky closed this Jan 8, 2026
@destinysky destinysky reopened this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant