-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[cuda 13][aarch64][CI] Adding CI steps to build arm64 cuda13 nightly wheels and images #28983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[cuda 13][aarch64][CI] Adding CI steps to build arm64 cuda13 nightly wheels and images #28983
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Signed-off-by: Shang Wang <[email protected]>
6c7ce36 to
6ad4052
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds CI steps to build aarch64 wheels and images for CUDA 13.0. The changes introduce two new jobs to the Buildkite release pipeline. My review has identified a critical issue where the builds will likely fail due to a hardcoded PyTorch version for an older CUDA version in the Dockerfile. Additionally, I've pointed out a high-severity concern regarding a change in the base build image to a newer Ubuntu version, which could impact the binary compatibility of the generated artifacts. Both issues are present in the two new CI steps.
| commands: | ||
| # #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here: | ||
| # https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7 | ||
| - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting CUDA_VERSION=13.0.1 for this arm64 build will likely cause it to fail. The docker/Dockerfile has a hardcoded PyTorch version for CUDA 12.8 (torch==2.8.0.dev20250318+cu128) for arm64 platforms (see docker/Dockerfile lines 344-352). The build process will attempt to find this cu128 package in the cu130 PyTorch index, which will not work. To fix this, the hardcoded PyTorch version in docker/Dockerfile needs to be updated or made dynamic to support CUDA 13.0.
| queue: arm64_cpu_queue_postmerge | ||
| commands: | ||
| - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" | ||
| - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the wheel build step, setting CUDA_VERSION=13.0.1 for this arm64 build will likely cause a failure. The docker/Dockerfile uses a hardcoded PyTorch version for CUDA 12.8 (torch==2.8.0.dev20250318+cu128) for arm64 platforms (lines 344-352), which is incompatible with the cu130 index that will be used. The hardcoded version in docker/Dockerfile needs to be adjusted for CUDA 13.0 support.
| commands: | ||
| # #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here: | ||
| # https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7 | ||
| - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BUILD_BASE_IMAGE is set to nvidia/cuda:13.0.1-devel-ubuntu22.04. This contradicts the project's stated goal of using an older Ubuntu version for builds to maintain broad glibc compatibility, as mentioned in docker/Dockerfile (lines 18-21). Using ubuntu22.04 may limit the portability of the generated wheel. Other arm64 builds in this pipeline use the default ubuntu20.04-based image. If this change is not intentional, consider removing the --build-arg BUILD_BASE_IMAGE to use the default.
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg VLLM_MAIN_CUDA_VERSION=13.0 --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."| queue: arm64_cpu_queue_postmerge | ||
| commands: | ||
| - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" | ||
| - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BUILD_BASE_IMAGE is set to an ubuntu22.04-based image, which may reduce the glibc compatibility of the resulting Docker image and the artifacts within. This is inconsistent with the project's documented approach in docker/Dockerfile (lines 18-21) and other arm64 builds in this file. Please consider removing the --build-arg BUILD_BASE_IMAGE argument if using ubuntu22.04 is not a strict requirement for CUDA 13.0.
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)-cuda13.0 --target vllm-openai --progress plain -f docker/Dockerfile ."
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.