[cuda 13][aarch64][CI] Adding CI steps to build arm64 cuda13 nightly wheels and images #28983

wangshangsam · 2025-11-19T02:35:11Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-11-19T02:35:22Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Signed-off-by: Shang Wang <[email protected]>

gemini-code-assist

Code Review

This pull request adds CI steps to build aarch64 wheels and images for CUDA 13.0. The changes introduce two new jobs to the Buildkite release pipeline. My review has identified a critical issue where the builds will likely fail due to a hardcoded PyTorch version for an older CUDA version in the Dockerfile. Additionally, I've pointed out a high-severity concern regarding a change in the base build image to a newer Ubuntu version, which could impact the binary compatibility of the generated artifacts. Both issues are present in the two new CI steps.

gemini-code-assist · 2025-11-19T02:36:54Z

.buildkite/release-pipeline.yaml

+    commands:
+      # #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
+      # https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."


Setting CUDA_VERSION=13.0.1 for this arm64 build will likely cause it to fail. The docker/Dockerfile has a hardcoded PyTorch version for CUDA 12.8 (torch==2.8.0.dev20250318+cu128) for arm64 platforms (see docker/Dockerfile lines 344-352). The build process will attempt to find this cu128 package in the cu130 PyTorch index, which will not work. To fix this, the hardcoded PyTorch version in docker/Dockerfile needs to be updated or made dynamic to support CUDA 13.0.

gemini-code-assist · 2025-11-19T02:36:54Z

.buildkite/release-pipeline.yaml

+      queue: arm64_cpu_queue_postmerge
+    commands:
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ."


Similar to the wheel build step, setting CUDA_VERSION=13.0.1 for this arm64 build will likely cause a failure. The docker/Dockerfile uses a hardcoded PyTorch version for CUDA 12.8 (torch==2.8.0.dev20250318+cu128) for arm64 platforms (lines 344-352), which is incompatible with the cu130 index that will be used. The hardcoded version in docker/Dockerfile needs to be adjusted for CUDA 13.0 support.

gemini-code-assist · 2025-11-19T02:36:54Z

.buildkite/release-pipeline.yaml

+    commands:
+      # #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
+      # https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."


The BUILD_BASE_IMAGE is set to nvidia/cuda:13.0.1-devel-ubuntu22.04. This contradicts the project's stated goal of using an older Ubuntu version for builds to maintain broad glibc compatibility, as mentioned in docker/Dockerfile (lines 18-21). Using ubuntu22.04 may limit the portability of the generated wheel. Other arm64 builds in this pipeline use the default ubuntu20.04-based image. If this change is not intentional, consider removing the --build-arg BUILD_BASE_IMAGE to use the default.

- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."

gemini-code-assist · 2025-11-19T02:36:54Z

.buildkite/release-pipeline.yaml

+      queue: arm64_cpu_queue_postmerge
+    commands:
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ."


The BUILD_BASE_IMAGE is set to an ubuntu22.04-based image, which may reduce the glibc compatibility of the resulting Docker image and the artifacts within. This is inconsistent with the project's documented approach in docker/Dockerfile (lines 18-21) and other arm64 builds in this file. Please consider removing the --build-arg BUILD_BASE_IMAGE argument if using ubuntu22.04 is not a strict requirement for CUDA 13.0.

- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ."

wangshangsam self-assigned this Nov 19, 2025

wangshangsam added the nvidia label Nov 19, 2025

github-project-automation bot added this to NVIDIA Nov 19, 2025

mergify bot added the ci/build label Nov 19, 2025

Adding CI step to build arm64 cuda13 wheel and image.

6ad4052

Signed-off-by: Shang Wang <[email protected]>

wangshangsam force-pushed the wangshangsam/cuda13-wheel-buildkite-step branch from 6c7ce36 to 6ad4052 Compare November 19, 2025 02:36

wangshangsam added the aarch64-cuda label Nov 19, 2025

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

Merge branch 'main' into wangshangsam/cuda13-wheel-buildkite-step

f716653

wangshangsam marked this pull request as draft November 19, 2025 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[cuda 13][aarch64][CI] Adding CI steps to build arm64 cuda13 nightly wheels and images #28983

[cuda 13][aarch64][CI] Adding CI steps to build arm64 cuda13 nightly wheels and images #28983

wangshangsam commented Nov 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[cuda 13][aarch64][CI] Adding CI steps to build arm64 cuda13 nightly wheels and images #28983

Are you sure you want to change the base?

[cuda 13][aarch64][CI] Adding CI steps to build arm64 cuda13 nightly wheels and images #28983

Conversation

wangshangsam commented Nov 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangshangsam commented Nov 19, 2025 •

edited by github-actions bot

Loading