Skip to content

Conversation

@micah-wil
Copy link
Contributor

@micah-wil micah-wil commented Nov 19, 2025

There are currently two tests failing for similar reasons called "Weight Loading Multiple GPU Test" and "Weight Loading Multiple GPU Test - Large Models." The models loaded for these tests are provided in tests/weight_loading/models.txt and tests/weight_loading/models-large.txt, and in both of those files there are tests for quantization types that are not supported on ROCm platforms, thus causing the tests to fail in AMD CI. I propose that we add separate files from which we load models for AMD CI so that we do not impact the coverage of the non-ROCm tests and we can add ROCm-specific models (e.g. AMD Quark quantized models) to these tests if we want.

Another approach would be to change the test itself to skip certain models if the current platform is ROCm, but that seems less robust to me. Please let me know what you think, thank you.

@mergify mergify bot added ci/build rocm Related to AMD ROCm labels Nov 19, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses test failures on ROCm platforms for weight loading tests by introducing separate model configuration files for AMD CI (models-amd.txt and models-large-amd.txt). This is a good approach to isolate platform-specific test configurations. The changes in the Buildkite pipeline configuration are correct, including the removal of an erroneous gpu: a100 specification for an AMD agent. I've suggested a couple of improvements for the new model files to improve test suite correctness and adhere to file format best practices.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 1328 to 1330
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models.txt
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-amd.txt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge AMD mirror still uses incompatible weight list

The AMD-specific configs are only hooked up here, but the same weight-loading steps in .buildkite/test-pipeline.yaml (lines 1182-1204) still mirror onto amdexperimental hardware and continue to point at weight_loading/models*.txt, which contain the ROCm-incompatible quantizations this change is trying to avoid. Those mirrored jobs will keep failing on ROCm even after this patch unless they also switch to the new AMD model lists.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant