-
Couldn't load subscription status.
- Fork 427
Add CUDA forward compatibility hook #948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change adds an nvidia-cdi-hook enable-cuda-compat hook that checks the container for cuda compat libs and updates /etc/ld.so.conf.d to include their parent folder if their driver major version is sufficient. This allows CUDA Forward Compatibility to be used when this is not available through the libnvidia-container. Signed-off-by: Evan Lezar <[email protected]>
This change adds the enable-cuda-compat hook to the incomming OCI runtime spec if the allow-cuda-compat-libs-from-container feature flag is not enabled. An update-ldcache hook is also injected to ensure that the required folders are processed. Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
3307cb1 to
c1bac28
Compare
| return []string{"nvidia-hook-remover", "feature-gated", "mode"} | ||
| default: | ||
| return []string{"mode", "graphics", "feature-gated"} | ||
| return []string{"feature-gated", "graphics", "mode"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar Hi, I have a question here. In this modifier order, it will create CreateContainer Hook like this ["enable-cuda-compat", "update-ldcache", "create-symlinks"]. As "update-ldcache" runs before "create-symlinks", so hook "create-symlinks" do some bind mount so(dynamic link library) in container will not add into ldcache?
NVIDIA Container Toolkit 1.17.5 requires Go >= 1.22 [1], and starts
using enable-cuda-compat hooks in the Container Device Interface
specification generated by it [2]. For example:
"hookName": "createContainer",
"path": "/usr/bin/nvidia-cdi-hook",
"args": [
"nvidia-cdi-hook",
"enable-cuda-compat",
"--host-driver-version=570.153.02"
]
The new hook makes it possible to have containers with a
/usr/local/cuda/compat/libcuda.so.* that's newer than the proprietary
NVIDIA driver on the host operating system, so that applications can use
a newer CUDA without having to update the driver [3]. Even though this
sounds useful, the hook has been disabled until it's handled by the
'init-container' command and there's a clear way to test it.
The src/go.sum file was updated with 'go mod tidy'.
[1] NVIDIA Container Toolkit commit 5bdf14b1e7c24763
NVIDIA/nvidia-container-toolkit@5bdf14b1e7c24763
NVIDIA/nvidia-container-toolkit#941
NVIDIA/nvidia-container-toolkit#950
[2] NVIDIA Container Toolkit commit 76040ff2ad63fb82
NVIDIA/nvidia-container-toolkit@76040ff2ad63fb82
NVIDIA/nvidia-container-toolkit#906
NVIDIA/nvidia-container-toolkit#948
[3] https://docs.nvidia.com/deploy/cuda-compatibility/
containers#1662
With #877 the default behaviour of the NVIDIA Container Runtime / NVIDIA Container Runtime Hook was changed to not mount compat libraries from the container into the container. This removed "automatic" support for CUDA Forward compatibility.
This change attempts to address this by adding a
createContainerHookthat will create a file in/etc/ld.so.conf.d/in the container to ensure that the/usr/local/cuda/compatlibraries are added to the ldcache over the libraries mounted from the host. The provided host diver version is compared to the version of the compat libraries in the container and the config update is only performed if the compat libraries are newer than the host drivers.Note that the hook only creates a file in the container's file system and does not perform any mount operations. This means that this mechanism is not present the same vulnerabilities causing CVE-2024-0132 and CVE-2025-23359.
In the case of the legacy runtime, this behaviour is only triggered if the
allow-cuda-compat-libs-from-containerfeature flag is not enabled. The CDI spec generation has also been extended to include this hook.This backports #906