-
Notifications
You must be signed in to change notification settings - Fork 899
Update gcp gpu image #6417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Update gcp gpu image #6417
Conversation
|
/smoke-test --gcp |
|
/smoke-test --gcp -k test_minimal |
sky/catalog/gcp_catalog.py
Outdated
| _image_df = common.read_catalog('gcp/images.csv', | ||
| pull_frequency_hours=0) | ||
| image_id = common.get_image_id_from_tag_impl(_image_df, tag, region) | ||
| if tag == 'skypilot:custom-cpu-ubuntu-2204': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test lines were added for testing and will be removed after the test passes before the PR is merged.
|
/smoke-test --gcp |
|
The failure also fail on master, not related to this PR. |
|
All tests pass. Could u help take a look on this PR and the catalog change? @Michaelvll |
|
Or @aylei Could u help take a look? |
|
/smoke-test --gcp |
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zpoint! Could you help run a few recent LLM examples we have to make sure the cuda driver does not break any modern frameworks (may need to update the versions of those framework in the examples to the latest). Also, we should include the tests for those framework in our smoke tests as well.:
https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1-distilled
https://github.com/skypilot-org/skypilot/blob/master/llm/sglang/llama2.yaml
|
/smoke-test --gcp -k test_deepseek_r1_vllm |
|
/smoke-test --gcp -k test_sglang_llama2_serving |
|
/smoke-test --gcp -k test_deepseek_r1_vllm |
|
There're so many test cases in example, added 2 cases. Create an issue as follow up to add rest. |
|
/smoke-test --gcp -k test_deepseek_r1_vllm |
|
/smoke-test --aws -k test_deepseek_r1_vllm |
|
/smoke-test --azure -k test_deepseek_r1_vllm |
|
/smoke-test --azure -k test_deepseek_r1_vllm |
|
/smoke-test --gcp -k test_deepseek_r1_vllm |
|
/smoke-test --gcp |
|
/smoke-test --gcp |
|
The failure tests re not related, see #6669 |
|
@Michaelvll Can we proceed with the review && merge? The GCP test result looks good. |
|
cc'ing @cg505 for a final look |
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @zpoint! Thanks for adding this! Please find the comments below : )
|
|
||
| # Download architecture-specific CUDA keyring package | ||
| # CRITICAL FIX: Install GCC 12 for kernel 6.8+ compatibility | ||
| # GCP's kernel 6.8.0-1033-gcp was built with GCC 12, but NVIDIA drivers expected GCC 11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused by the commands, if nvidia driver expects GCC 11, why we are setting GCC 12 as the default compiler to use? Also, can we just use an older GCP base ubuntu image instead, e.g. just use the version that was used for building the previous images? Manually installing GCC needs extra care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comments were written when testing several different GCC versions, and only GCC12 works and is kept.
I'm trying to reuse the cuda.sh to make as few changes as possible now.
| echo "=== Setting up CUDA environment ===" | ||
| # Add CUDA to system-wide profile | ||
| echo 'export PATH="/usr/local/cuda/bin:$PATH"' | sudo tee -a /etc/profile | ||
| echo 'export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"' | sudo tee -a /etc/profile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this was not needed previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding /sbin in the path as well? https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/qwen3-235b.yaml#L30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's also because of the custom CUDA and GCC version installation.
|
We need the catalog from skypilot-org/skypilot-catalog#150 to be merged and updated so CI can pass. |
|
/smoke-test --gcp |
d2d3ece to
b12166c
Compare
Follow up on #6191
Build new gcp's CPU and GPU image.
skypilot-org/skypilot-catalog#150
Above PR updates the catalog repo, which is required to make it work.
The failure also fail on master, not related to this PR.
Old base image doesn't work without upgrading GCC
Tried to revert the base image to
ubuntu-2204-jammy-v20240927but failed @Michaelvll :Before this PR:
After updated catalog to new image:
Tested (run the relevant ones):
/smoke-test --gcp(CI) orpytest tests/test_smoke.py(local)