Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix behavior for present but failing nvidiasmi #910

Conversation

casparvl
Copy link
Collaborator

We have some cpu nodes on which the nvidia-smi command is installed, but failing

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I.e. simply checking for the presence of nvidia-smi and then concluding that GPUs are available (as we did prior to this PR) is not very robust. Since nvidia-smi does give a non-zero exit code in this case, it's pretty easy to improve the robustness of the check, which is done in this PR.

Caspar van Leeuwen added 3 commits February 10, 2025 19:00
…. The command will exist, but return a non-zero exit when run with .e.g --version because there are no GPU drivers
…. The command will exist, but return a non-zero exit when run with .e.g --version because there are no GPU drivers
…. The command will exist, but return a non-zero exit when run with .e.g --version because there are no GPU drivers
Copy link

eessi-bot bot commented Feb 10, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Feb 10, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-casparvl
Copy link

Instance eessi-bot-casparvl is configured to build for:

  • architectures: x86_64/amd/zen4, x86_64/amd/zen2
  • repositories: eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 10, 2025

Instance eessi-bot-vsc-ugent is configured to build for:

  • architectures: x86_64/amd/zen3
  • repositories: eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat

Copy link
Collaborator

@laraPPr laraPPr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@laraPPr laraPPr merged commit f3fe933 into EESSI:2023.06-software.eessi.io Feb 13, 2025
50 checks passed
Copy link

eessi-bot bot commented Feb 13, 2025

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.02.13

@riscv-eessi-io-bot
Copy link

PR merged! Moved [] to /home/eessibot/shared/trash_bin/EESSI/software-layer/2025.02.13

Copy link

eessi-bot bot commented Feb 13, 2025

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.02.13

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 13, 2025

PR merged! Moved [] to /scratch/gent/vo/002/gvo00211/SHARED/trash_bin/EESSI/software-layer/2025.02.13

@boegel
Copy link
Contributor

boegel commented Feb 13, 2025

Lots of copy-pasting, can we consider moving this check into a utility function so we don't have this duplicate code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants