-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm #28984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Micah Williamson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses test failures on ROCm platforms for weight loading tests by introducing separate model configuration files for AMD CI (models-amd.txt and models-large-amd.txt). This is a good approach to isolate platform-specific test configurations. The changes in the Buildkite pipeline configuration are correct, including the removal of an erroneous gpu: a100 specification for an AMD agent. I've suggested a couple of improvements for the new model files to improve test suite correctness and adhere to file format best practices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| commands: | ||
| - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models.txt | ||
| - bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-amd.txt | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD mirror still uses incompatible weight list
The AMD-specific configs are only hooked up here, but the same weight-loading steps in .buildkite/test-pipeline.yaml (lines 1182-1204) still mirror onto amdexperimental hardware and continue to point at weight_loading/models*.txt, which contain the ROCm-incompatible quantizations this change is trying to avoid. Those mirrored jobs will keep failing on ROCm even after this patch unless they also switch to the new AMD model lists.
Useful? React with 👍 / 👎.
Signed-off-by: Micah Williamson <[email protected]>
Signed-off-by: Micah Williamson <[email protected]>
There are currently two tests failing for similar reasons called "Weight Loading Multiple GPU Test" and "Weight Loading Multiple GPU Test - Large Models." The models loaded for these tests are provided in
tests/weight_loading/models.txtandtests/weight_loading/models-large.txt, and in both of those files there are tests for quantization types that are not supported on ROCm platforms, thus causing the tests to fail in AMD CI. I propose that we add separate files from which we load models for AMD CI so that we do not impact the coverage of the non-ROCm tests and we can add ROCm-specific models (e.g. AMD Quark quantized models) to these tests if we want.Another approach would be to change the test itself to skip certain models if the current platform is ROCm, but that seems less robust to me. Please let me know what you think, thank you.