Skip to content

Conversation

@orionr
Copy link
Collaborator

@orionr orionr commented Dec 10, 2025

Use standard Docker image instead of torch_nightly image for PyTorch nightlies testing and CI runs.

Moving this from #239 to a branch on upstream for testing purposes outlined at https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo

Testing in progress:

  1. Baseline (my vllm fork matching HEAD, no ci-infra changes) at https://buildkite.com/vllm/ci/builds/42874/steps/canvas. Allowed 5 test runs to move forward. -> Seems like PT nightlies build itself failed on installing flashinfer so all tests failed afterwards.
  2. Delta (my vllm fork matching HEAD, these ci-infra changes) at https://buildkite.com/vllm/ci/builds/42927/steps/canvas. A few tests to check. -> Command issue for PT install

After all this lands we can remove https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.nightly_torch but not urgent.

cc @huydhn @atalman @yangw-dev @khluu

@orionr orionr changed the title [PT nightlies] Remove nightly_torch Docker image and build [WIP][PT nightlies] Remove nightly_torch Docker image and build Dec 10, 2025
@orionr
Copy link
Collaborator Author

orionr commented Dec 10, 2025

@khluu I might need your help on this one and/or have you point me to an expert on Buildkite configs.

I'm trying to use the standard Docker builds here for PyTorch nightly testing, but need to also run uv pip install torch torchvision torchaudio --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu128 (or create a Docker image layer) before each test runs. I thought I'd figured this out by adding an extra commands section, but looks like that might need to get propagated down to and through render_cuda_config. Is that the right way to do this or should I go a different path?

Current status is that the main Docker image is used (which is good), but tests are all running on release PyTorch versions (not good) without the latest changes.

Latest failing run is at https://buildkite.com/vllm/ci/builds/42927/steps/canvas?sid=019b0a30-bee1-4b6b-8393-7f85b537d2ef with the error


[2025-12-10T22:21:19Z] public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:2dcbac9077ecadff0aa78b7c282f9e147a260e86
--
Error: Can't use both a step level command and the command parameter of the plugin

because of e596c0d#diff-b5c060fa4acd68fd48a2b3cdcd4069bd9eae5b0ee8512e1b25d8f8e2526834e5R480

Any thoughts? cc @atalman as well and I'll keep digging.

@huydhn
Copy link
Collaborator

huydhn commented Dec 10, 2025

@khluu I might need your help on this one and/or have you point me to an expert on Buildkite configs.

I'm trying to use the standard Docker builds here for PyTorch nightly testing, but need to also run uv pip install torch torchvision torchaudio --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu128 (or create a Docker image layer) before each test runs. I thought I'd figured this out by adding an extra commands section, but looks like that might need to get propagated down to and through render_cuda_config. Is that the right way to do this or should I go a different path?

Current status is that the main Docker image is used (which is good), but tests are all running on release PyTorch versions (not good) without the latest changes.

Latest failing run is at https://buildkite.com/vllm/ci/builds/42927/steps/canvas?sid=019b0a30-bee1-4b6b-8393-7f85b537d2ef with the error


[2025-12-10T22:21:19Z] public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:2dcbac9077ecadff0aa78b7c282f9e147a260e86
--
Error: Can't use both a step level command and the command parameter of the plugin

because of e596c0d#diff-b5c060fa4acd68fd48a2b3cdcd4069bd9eae5b0ee8512e1b25d8f8e2526834e5R480

Any thoughts? cc @atalman as well and I'll keep digging.

I think the uv pip install torch --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu12x could only be done as a Docker layer inside https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile#L139-L143. Something likes:

if NIGHTLY == 1:
   uv pip install torch --pre --extra-index-url ${PYTORCH_CUDA_NIGHTLY_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
   python use_existing_torch
else:
    uv pip install -r requirements/cuda.txt --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')

@orionr
Copy link
Collaborator Author

orionr commented Dec 10, 2025

Good call on needing build as well as test signal. Let me see what I can do to modify the base Dockerfile.

@orionr orionr force-pushed the orionr/pt-nightlies branch from 5424fa5 to 55368b3 Compare December 20, 2025 16:27
@orionr orionr changed the title [WIP][PT nightlies] Remove nightly_torch Docker image and build [PT nightlies] Remove nightly_torch Docker image and build Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants