Speed up macOS smoke test #28954

mgoin · 2025-11-18T17:32:57Z

Purpose

Use a smaller model to fix timeout issues

Test Plan

Test Result

Confirmed https://github.com/vllm-project/vllm/actions/runs/19479854049/job/55748976858 works fine manually

The main issue is that it seems distributed init takes 15 minutes to finish

(EngineCore_DP0 pid=5220) INFO 11-18 20:15:25 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.64.24:49303 backend=gloo
(EngineCore_DP0 pid=5220) INFO 11-18 20:31:10 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

I think the likely bottleneck is torch.distributed.new_group() in GroupCoordinator.__init__(). For world_size=1, initialize_model_parallel() creates 5 GroupCoordinator instances (TP, DCP, PP, DP, EP), each creating 2 groups (device + CPU), totaling 10 new_group() calls. Even for single-process groups, PyTorch may still perform slow initialization. The main optimization would be to skip or optimize group creation for single-process cases, but that's a larger change

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Michael Goin <[email protected]>

gemini-code-assist · 2025-11-18T17:33:03Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

chatgpt-codex-connector

💡 Codex Review

vllm/.github/workflows/macos-smoke-test.yml

Line 72 in 3a311da

"model": "Qwen/Qwen3-0.6B",

Update completion request to match served model

The smoke test now launches the server with trl-internal-testing/tiny-random-LlamaForCausalLM, but the completion request still posts "model": "Qwen/Qwen3-0.6B". With OpenAI-compatible APIs, a request for an unloaded model returns an error, so this curl check will consistently fail even though the server is running, breaking the workflow on every run. The request should target the model that was actually started.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Michael Goin <[email protected]>

Signed-off-by: mgoin <[email protected]>

Signed-off-by: Michael Goin <[email protected]>

mgoin · 2025-11-18T20:49:05Z

Confirmed https://github.com/vllm-project/vllm/actions/runs/19479854049/job/55748976858 works fine manually

The main issue is that it seems distributed init takes 15 minutes to finish

(EngineCore_DP0 pid=5220) INFO 11-18 20:15:25 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.64.24:49303 backend=gloo
(EngineCore_DP0 pid=5220) INFO 11-18 20:31:10 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

I think the likely bottleneck is torch.distributed.new_group() in GroupCoordinator.__init__(). For world_size=1, initialize_model_parallel() creates 5 GroupCoordinator instances (TP, DCP, PP, DP, EP), each creating 2 groups (device + CPU), totaling 10 new_group() calls. Even for single-process groups, PyTorch may still perform slow initialization. The main optimization would be to skip or optimize group creation for single-process cases, but that's a larger change

Speed up macOS smoke test

3a311da

Signed-off-by: Michael Goin <[email protected]>

mergify bot added the ci/build label Nov 18, 2025

chatgpt-codex-connector bot reviewed Nov 18, 2025

View reviewed changes

Update model in macOS smoke test workflow

b9da2cc

Signed-off-by: Michael Goin <[email protected]>

DarkLight1337 approved these changes Nov 18, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) November 18, 2025 17:41

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 18, 2025

mgoin disabled auto-merge November 18, 2025 18:34

mgoin added 3 commits November 18, 2025 13:36

Readd Qwen but with 2 layers

4ca4449

Signed-off-by: Michael Goin <[email protected]>

Fix format

350adb2

Signed-off-by: mgoin <[email protected]>

Increase timeout for macOS smoke test to 30 minutes

a28ca34

Signed-off-by: Michael Goin <[email protected]>

Merge branch 'main' into macos-smoke-test-fast

5122342

mgoin enabled auto-merge (squash) November 18, 2025 20:49

Merge branch 'main' into macos-smoke-test-fast

81e0a0d

mgoin merged commit a4511e3 into main Nov 19, 2025
17 checks passed

mgoin deleted the macos-smoke-test-fast branch November 19, 2025 06:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Speed up macOS smoke test #28954

Speed up macOS smoke test #28954

mgoin commented Nov 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

mgoin commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Speed up macOS smoke test #28954

Speed up macOS smoke test #28954

Conversation

mgoin commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

mgoin commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mgoin commented Nov 18, 2025 •

edited

Loading