[ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm #28984

micah-wil · 2025-11-19T03:05:21Z

There are currently two tests failing for similar reasons called "Weight Loading Multiple GPU Test" and "Weight Loading Multiple GPU Test - Large Models." The models loaded for these tests are provided in tests/weight_loading/models.txt and tests/weight_loading/models-large.txt, and in both of those files there are tests for quantization types that are not supported on ROCm platforms, thus causing the tests to fail in AMD CI. I propose that we add separate files from which we load models for AMD CI so that we do not impact the coverage of the non-ROCm tests and we can add ROCm-specific models (e.g. AMD Quark quantized models) to these tests if we want.

Another approach would be to change the test itself to skip certain models if the current platform is ROCm, but that seems less robust to me. Please let me know what you think, thank you.

Signed-off-by: Micah Williamson <[email protected]>

gemini-code-assist

Code Review

This pull request addresses test failures on ROCm platforms for weight loading tests by introducing separate model configuration files for AMD CI (models-amd.txt and models-large-amd.txt). This is a good approach to isolate platform-specific test configurations. The changes in the Buildkite pipeline configuration are correct, including the removal of an erroneous gpu: a100 specification for an AMD agent. I've suggested a couple of improvements for the new model files to improve test suite correctness and adhere to file format best practices.

tests/weight_loading/models-amd.txt

tests/weight_loading/models-large-amd.txt

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-19T03:11:43Z

.buildkite/test-amd.yaml

  commands:
-    - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models.txt
+    - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-amd.txt



AMD mirror still uses incompatible weight list

The AMD-specific configs are only hooked up here, but the same weight-loading steps in .buildkite/test-pipeline.yaml (lines 1182-1204) still mirror onto amdexperimental hardware and continue to point at weight_loading/models*.txt, which contain the ROCm-incompatible quantizations this change is trying to avoid. Those mirrored jobs will keep failing on ROCm even after this patch unless they also switch to the new AMD model lists.

Useful? React with 👍 / 👎.

Signed-off-by: Micah Williamson <[email protected]>

Add AMD-specific models for weight loading test

acfaa25

Signed-off-by: Micah Williamson <[email protected]>

micah-wil requested review from mgoin, yewentao256 and youkaichao as code owners November 19, 2025 03:05

mergify bot added ci/build rocm Related to AMD ROCm labels Nov 19, 2025

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

tests/weight_loading/models-amd.txt Outdated Show resolved Hide resolved

tests/weight_loading/models-large-amd.txt Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 19, 2025

View reviewed changes

micah-wil added 2 commits November 19, 2025 03:17

fix duplicate model

f23a2b6

Signed-off-by: Micah Williamson <[email protected]>

Removing TheBloke/Llama-2-7B-GPTQ from models-amd.txt

13f11d7

Signed-off-by: Micah Williamson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm #28984

[ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm #28984

micah-wil commented Nov 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm #28984

Are you sure you want to change the base?

[ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm #28984

Conversation

micah-wil commented Nov 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

micah-wil commented Nov 19, 2025 •

edited by github-actions bot

Loading