[PT nightlies] Remove nightly_torch Docker image and build #244

orionr · 2025-12-10T17:01:37Z

Use standard Docker image instead of torch_nightly image for PyTorch nightlies testing and CI runs.

Moving this from #239 to a branch on upstream for testing purposes outlined at https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo

Testing in progress:

Baseline (my vllm fork matching HEAD, no ci-infra changes) at https://buildkite.com/vllm/ci/builds/42874/steps/canvas. Allowed 5 test runs to move forward. -> Seems like PT nightlies build itself failed on installing flashinfer so all tests failed afterwards.
Delta (my vllm fork matching HEAD, these ci-infra changes) at https://buildkite.com/vllm/ci/builds/42927/steps/canvas. A few tests to check. -> Command issue for PT install

After all this lands we can remove https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.nightly_torch but not urgent.

cc @huydhn @atalman @yangw-dev @khluu

orionr · 2025-12-10T22:54:09Z

@khluu I might need your help on this one and/or have you point me to an expert on Buildkite configs.

I'm trying to use the standard Docker builds here for PyTorch nightly testing, but need to also run uv pip install torch torchvision torchaudio --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu128 (or create a Docker image layer) before each test runs. I thought I'd figured this out by adding an extra commands section, but looks like that might need to get propagated down to and through render_cuda_config. Is that the right way to do this or should I go a different path?

Current status is that the main Docker image is used (which is good), but tests are all running on release PyTorch versions (not good) without the latest changes.

Latest failing run is at https://buildkite.com/vllm/ci/builds/42927/steps/canvas?sid=019b0a30-bee1-4b6b-8393-7f85b537d2ef with the error


[2025-12-10T22:21:19Z] public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:2dcbac9077ecadff0aa78b7c282f9e147a260e86
--
Error: Can't use both a step level command and the command parameter of the plugin

because of e596c0d#diff-b5c060fa4acd68fd48a2b3cdcd4069bd9eae5b0ee8512e1b25d8f8e2526834e5R480

Any thoughts? cc @atalman as well and I'll keep digging.

huydhn · 2025-12-10T23:31:18Z

@khluu I might need your help on this one and/or have you point me to an expert on Buildkite configs.

I'm trying to use the standard Docker builds here for PyTorch nightly testing, but need to also run uv pip install torch torchvision torchaudio --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu128 (or create a Docker image layer) before each test runs. I thought I'd figured this out by adding an extra commands section, but looks like that might need to get propagated down to and through render_cuda_config. Is that the right way to do this or should I go a different path?

Current status is that the main Docker image is used (which is good), but tests are all running on release PyTorch versions (not good) without the latest changes.

Latest failing run is at https://buildkite.com/vllm/ci/builds/42927/steps/canvas?sid=019b0a30-bee1-4b6b-8393-7f85b537d2ef with the error
[2025-12-10T22:21:19Z] public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:2dcbac9077ecadff0aa78b7c282f9e147a260e86
--
Error: Can't use both a step level command and the command parameter of the plugin
because of e596c0d#diff-b5c060fa4acd68fd48a2b3cdcd4069bd9eae5b0ee8512e1b25d8f8e2526834e5R480

Any thoughts? cc @atalman as well and I'll keep digging.

I think the uv pip install torch --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu12x could only be done as a Docker layer inside https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile#L139-L143. Something likes:

if NIGHTLY == 1:
   uv pip install torch --pre --extra-index-url ${PYTORCH_CUDA_NIGHTLY_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
   python use_existing_torch
else:
    uv pip install -r requirements/cuda.txt --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')

orionr · 2025-12-10T23:47:07Z

Good call on needing build as well as test signal. Let me see what I can do to modify the base Dockerfile.

…ard image. Signed-off-by: Orion Reblitz-Richardson <[email protected]>

orionr changed the title ~~[PT nightlies] Remove nightly_torch Docker image and build~~ [WIP][PT nightlies] Remove nightly_torch Docker image and build Dec 10, 2025

orionr mentioned this pull request Dec 11, 2025

[PT nightlies] Remove nightly_torch Docker image and use standard vllm-project/vllm#30443

Open

[PT nightlies] Remove nightly_torch Docker image and build. Use stand…

55368b3

…ard image. Signed-off-by: Orion Reblitz-Richardson <[email protected]>

orionr force-pushed the orionr/pt-nightlies branch from 5424fa5 to 55368b3 Compare December 20, 2025 16:27

orionr changed the title ~~[WIP][PT nightlies] Remove nightly_torch Docker image and build~~ [PT nightlies] Remove nightly_torch Docker image and build Dec 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PT nightlies] Remove nightly_torch Docker image and build #244

[PT nightlies] Remove nightly_torch Docker image and build #244

Uh oh!

orionr commented Dec 10, 2025 •

edited

Loading

Uh oh!

orionr commented Dec 10, 2025

Uh oh!

huydhn commented Dec 10, 2025

Uh oh!

orionr commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[PT nightlies] Remove nightly_torch Docker image and build #244

Are you sure you want to change the base?

[PT nightlies] Remove nightly_torch Docker image and build #244

Uh oh!

Conversation

orionr commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionr commented Dec 10, 2025

Uh oh!

huydhn commented Dec 10, 2025

Uh oh!

orionr commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

orionr commented Dec 10, 2025 •

edited

Loading