Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the rocm-mpich-base container. #15

Merged
merged 10 commits into from
Mar 13, 2024
Merged

Adds the rocm-mpich-base container. #15

merged 10 commits into from
Mar 13, 2024

Conversation

dipietrantonio
Copy link
Contributor

I will prepare the equivalent docker files after having done more experimenting with it.

Copy link
Contributor

@marcodelapierre marcodelapierre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

General comment: I strongly recommend considering the migration to Dockerfiles ASAP, they are the de facto standard. We could then promptly generate SIF images after the Docker build, for HPC-ready compact images.

Both images built successfully.

import tensorflow in the Tensorflow image produced the following:

2023-08-28 06:38:58.329453: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Can this be addressed in a time-effective way? Otherwise good to defer to later.

See comments across the codes.

@AlexisEspinosaGayosso
Copy link
Collaborator

AlexisEspinosaGayosso commented Oct 31, 2023

@dipietrantonio , would it be possible to separate this as three different pull requests? One for the rocm-mpich-base.dockerfile, one for the pytorch and a third for tensorflow? In that way, a fix for one will not block the others (if the others are ready).

For example, I think we are in perfect position of pulling the rocm one. Indeed, I think this is sort of "urgent". As many other images (ours, a users made) can be built FROM this one. By the way, in that one, maybe we should stick to rocm/5.6 due to driver compatibility.

In regards to the PyTorch one, I did not read carefully, but can that and also the Tensorflow ones start FROM the rocm-mpich one using dockerfiles? I mean:

FROM rocm-mpich-base:3.4.3_ubuntu22.04

Cheers.

@dipietrantonio
Copy link
Contributor Author

Hi @AlexisEspinosaGayosso ,

would it be possible to separate this as three different pull requests?
Yes I will do that, makes more sense.

For example, I think we are in perfect position of pulling the rocm one.

I am still working on that one, testing a few remaining things. I hope I can push it today to our Pawsey repository.

I will update this PR to be just about the rocm-mpich-base. I think the appropriate name should be ubuntu:22-rocm5.6.0-mpich3.4.3 as this is mainly a base Ubuntu image, containing a few needed libraries. We should have a discussion about naming conventions.

The reason is the driver on Setonix is too old for ROCm 5.7.

Also set a few important environment variables.
@dipietrantonio dipietrantonio changed the title Adds rocm-mpich and rocm-tensorflow containers. Adds the rocm-mpich-base container. Nov 1, 2023
@dipietrantonio dipietrantonio dismissed marcodelapierre’s stale review November 1, 2023 09:00

Marco is no longer part of the project.

@dipietrantonio
Copy link
Contributor Author

@pelahi @AlexisEspinosaGayosso this PR is ready for review.

Copy link
Collaborator

@pelahi pelahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this container doesn't build from the mpich lustre container?

@pelahi pelahi added the enhancement New feature or request label Dec 13, 2023
@dipietrantonio
Copy link
Contributor Author

I install libfabric as a dependency of https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl , which is needed for optimal performance of the ROCm Communication Collective Library (RCCL). So this is a AMD specific build. I might get away by installing lib fabric and aws-ofi-rccl on top of the mpich-lustre container. I will try.

Ubuntu added 3 commits March 13, 2024 02:59
- added a readme to describe how to build the container
- update the recipe to add luster to ensure lustre aware mpi
- updated recipe to generalise it a bit, added comments to
@pelahi pelahi merged commit 0ce5753 into master Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants