Skip to content

Conversation

@zpoint
Copy link
Collaborator

@zpoint zpoint commented Jul 29, 2025

Follow up on #6191

Build new gcp's CPU and GPU image.

skypilot-org/skypilot-catalog#150

Above PR updates the catalog repo, which is required to make it work.

The failure also fail on master, not related to this PR.

Old base image doesn't work without upgrading GCC

Tried to revert the base image to ubuntu-2204-jammy-v20240927 but failed @Michaelvll :

Line 859-860:
What happened:

  1. The base image ubuntu-2204-jammy-v20240927 started with an older kernel
  2. During the provisioning process, something installed the linux-gcp package
  3. This automatically pulled in the latest GCP kernel 6.8.0-1015-gcp which was compiled with GCC 12
  4. The NVIDIA driver then tried to compile against this newer kernel and failed

The root cause is that GCP's guest environment automatically installs the latest linux-gcp kernel during system initialization, and this kernel was compiled with GCC 12. The hardcoded base image approach won't work because the kernel gets updated during the provisioning process.

Before this PR:

(sky) ➜  skypilot git:(master) sky launch --infra gcp 'ls /home' 
Command to run: ls /home
Considered resources (1 node):
------------------------------------------------------------------------------------
 INFRA                 INSTANCE        vCPUs   Mem(GB)   GPUS   COST ($)   CHOSEN   
------------------------------------------------------------------------------------
 GCP (us-central1-a)   n4-standard-8   8       32        -      0.38          ✔     
------------------------------------------------------------------------------------
Launching a new cluster 'sky-f306-zepingguo'. Proceed? [Y/n]: Y
⚙︎ Launching on GCP us-central1 (us-central1-a).
└── Instance is up.
✓ Cluster launched: sky-f306-zepingguo.  View logs: sky api logs -l sky-2025-07-24-22-31-20-663541/provision.log
⚙︎ Syncing files.
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=3143) adithyansujithkumar
(sky-cmd, pid=3143) cooperc
(sky-cmd, pid=3143) danz
(sky-cmd, pid=3143) gcpuser
(sky-cmd, pid=3143) seungjinyang
(sky-cmd, pid=3143) ubuntu
(sky-cmd, pid=3143) yikaluo
✓ Job finished (status: SUCCEEDED).

After updated catalog to new image:

(sky) ➜  skypilot git:(master) sky launch --infra gcp 'ls /home'
Command to run: ls /home
Considered resources (1 node):
------------------------------------------------------------------------------------
 INFRA                 INSTANCE        vCPUs   Mem(GB)   GPUS   COST ($)   CHOSEN   
------------------------------------------------------------------------------------
 GCP (us-central1-a)   n4-standard-8   8       32        -      0.38          ✔     
------------------------------------------------------------------------------------
Launching a new cluster 'sky-e6da-zepingguo'. Proceed? [Y/n]: Y
⚙︎ Launching on GCP us-central1 (us-central1-a).
└── Instance is up.
✓ Cluster launched: sky-e6da-zepingguo.  View logs: sky api logs -l sky-2025-07-24-22-47-00-597721/provision.log
⚙︎ Syncing files.
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=2988) gcpuser
(sky-cmd, pid=2988) ubuntu
✓ Job finished (status: SUCCEEDED).

Tested (run the relevant ones):

  • All smoke tests: /smoke-test --gcp (CI) or pytest tests/test_smoke.py (local)

@zpoint zpoint requested a review from Michaelvll July 29, 2025 11:06
@zpoint
Copy link
Collaborator Author

zpoint commented Jul 30, 2025

/smoke-test --gcp

@zpoint
Copy link
Collaborator Author

zpoint commented Jul 30, 2025

/smoke-test --gcp -k test_minimal

_image_df = common.read_catalog('gcp/images.csv',
pull_frequency_hours=0)
image_id = common.get_image_id_from_tag_impl(_image_df, tag, region)
if tag == 'skypilot:custom-cpu-ubuntu-2204':
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test lines were added for testing and will be removed after the test passes before the PR is merged.

@zpoint
Copy link
Collaborator Author

zpoint commented Jul 30, 2025

/smoke-test --gcp

@zpoint
Copy link
Collaborator Author

zpoint commented Jul 31, 2025

The failure also fail on master, not related to this PR.

@zpoint zpoint requested a review from aylei July 31, 2025 03:28
@zpoint
Copy link
Collaborator Author

zpoint commented Jul 31, 2025

All tests pass. Could u help take a look on this PR and the catalog change? @Michaelvll

@zpoint
Copy link
Collaborator Author

zpoint commented Jul 31, 2025

Or @aylei Could u help take a look?

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --gcp

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! Could you help run a few recent LLM examples we have to make sure the cuda driver does not break any modern frameworks (may need to update the versions of those framework in the examples to the latest). Also, we should include the tests for those framework in our smoke tests as well.:
https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1-distilled
https://github.com/skypilot-org/skypilot/blob/master/llm/sglang/llama2.yaml

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --gcp -k test_deepseek_r1_vllm

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --gcp -k test_sglang_llama2_serving

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --gcp -k test_deepseek_r1_vllm
/smoke-test --gcp -k test_sglang_llava_serving

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

There're so many test cases in example, added 2 cases. Create an issue as follow up to add rest.

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --gcp -k test_deepseek_r1_vllm
/smoke-test --gcp -k test_sglang_llava_serving

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --aws -k test_deepseek_r1_vllm
/smoke-test --aws -k test_sglang_llava_serving

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --azure -k test_deepseek_r1_vllm
/smoke-test --azure -k test_sglang_llava_serving

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --azure -k test_deepseek_r1_vllm

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --gcp -k test_deepseek_r1_vllm
/smoke-test --gcp -k test_sglang_llava_serving

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 13, 2025

/smoke-test --gcp

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 14, 2025

/smoke-test --gcp

@zpoint
Copy link
Collaborator Author

zpoint commented Aug 14, 2025

The failure tests re not related, see #6669

@zpoint zpoint requested a review from Michaelvll August 14, 2025 06:19
@zpoint
Copy link
Collaborator Author

zpoint commented Aug 14, 2025

@Michaelvll Can we proceed with the review && merge? The GCP test result looks good.

@Michaelvll
Copy link
Collaborator

cc'ing @cg505 for a final look

@Michaelvll Michaelvll requested a review from cg505 August 20, 2025 03:20
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @zpoint! Thanks for adding this! Please find the comments below : )


# Download architecture-specific CUDA keyring package
# CRITICAL FIX: Install GCC 12 for kernel 6.8+ compatibility
# GCP's kernel 6.8.0-1033-gcp was built with GCC 12, but NVIDIA drivers expected GCC 11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by the commands, if nvidia driver expects GCC 11, why we are setting GCC 12 as the default compiler to use? Also, can we just use an older GCP base ubuntu image instead, e.g. just use the version that was used for building the previous images? Manually installing GCC needs extra care.

Copy link
Collaborator Author

@zpoint zpoint Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comments were written when testing several different GCC versions, and only GCC12 works and is kept.

I'm trying to reuse the cuda.sh to make as few changes as possible now.

Comment on lines 89 to 92
echo "=== Setting up CUDA environment ==="
# Add CUDA to system-wide profile
echo 'export PATH="/usr/local/cuda/bin:$PATH"' | sudo tee -a /etc/profile
echo 'export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"' | sudo tee -a /etc/profile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this was not needed previously?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's also because of the custom CUDA and GCC version installation.

@zpoint
Copy link
Collaborator Author

zpoint commented Sep 23, 2025

We need the catalog from skypilot-org/skypilot-catalog#150 to be merged and updated so CI can pass.

@zpoint zpoint requested a review from Michaelvll September 23, 2025 14:30
@zpoint
Copy link
Collaborator Author

zpoint commented Sep 23, 2025

/smoke-test --gcp

@zpoint zpoint force-pushed the dev/zeping/gcp_images branch from d2d3ece to b12166c Compare September 24, 2025 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants