-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds the rocm-mpich-base
container.
#15
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
General comment: I strongly recommend considering the migration to Dockerfiles ASAP, they are the de facto standard. We could then promptly generate SIF images after the Docker build, for HPC-ready compact images.
Both images built successfully.
import tensorflow
in the Tensorflow image produced the following:
2023-08-28 06:38:58.329453: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Can this be addressed in a time-effective way? Otherwise good to defer to later.
See comments across the codes.
@dipietrantonio , would it be possible to separate this as three different pull requests? One for the rocm-mpich-base.dockerfile, one for the pytorch and a third for tensorflow? In that way, a fix for one will not block the others (if the others are ready). For example, I think we are in perfect position of pulling the rocm one. Indeed, I think this is sort of "urgent". As many other images (ours, a users made) can be built FROM this one. By the way, in that one, maybe we should stick to rocm/5.6 due to driver compatibility. In regards to the PyTorch one, I did not read carefully, but can that and also the Tensorflow ones start FROM the rocm-mpich one using dockerfiles? I mean:
Cheers. |
I am still working on that one, testing a few remaining things. I hope I can push it today to our Pawsey repository. I will update this PR to be just about the |
The reason is the driver on Setonix is too old for ROCm 5.7. Also set a few important environment variables.
rocm-mpich-base
container.
Marco is no longer part of the project.
@pelahi @AlexisEspinosaGayosso this PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this container doesn't build from the mpich lustre container?
I install libfabric as a dependency of https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl , which is needed for optimal performance of the ROCm Communication Collective Library (RCCL). So this is a AMD specific build. I might get away by installing lib fabric and aws-ofi-rccl on top of the mpich-lustre container. I will try. |
- added a readme to describe how to build the container - update the recipe to add luster to ensure lustre aware mpi - updated recipe to generalise it a bit, added comments to
I will prepare the equivalent docker files after having done more experimenting with it.