-
Notifications
You must be signed in to change notification settings - Fork 349
Add nvidia-cdi-refresh service #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an NVIDIA CDI refresh service to automatically update the NVIDIA CDI specification when system events occur, alongside packaging and integration changes for Debian, RPM, and Docker build processes.
- Added new systemd unit files and udev rules to trigger the CDI refresh service.
- Updated packaging scripts for both Debian and RPM distributions and modified Dockerfiles to include new deployment artifacts.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
packaging/rpm/SPECS/nvidia-container-toolkit.spec | Added Source entries and installation steps for the new service components. |
packaging/debian/nvidia-container-toolkit-cdi-refresh.postinst | Introduced post-installation steps for Debian packaging. |
packaging/debian/nvidia-container-toolkit-cdi-refresh.install | Listed new deployment files for Debian. |
packaging/debian/control | Added a new control entry for the CDI refresh service package. |
docker/Dockerfile.* | Updated Dockerfiles to copy new systemd and udev files. |
deployments/udev/60-nvidia-cdi-refresh.rules | New udev rules to trigger service on NVIDIA events. |
deployments/systemd/nvidia-cdi-refresh.service | Defined the one-shot systemd service to refresh the CDI spec. |
deployments/systemd/nvidia-cdi-refresh.path | Defined the systemd path unit to monitor module changes. |
Comments suppressed due to low confidence (1)
packaging/debian/control:33
- The package name 'nvidia-container-toolkit-cdi-refresh service' contains a space which might cause issues; consider renaming it to 'nvidia-container-toolkit-cdi-refresh-service' for consistency.
Package: nvidia-container-toolkit-cdi-refresh service
09aee84
to
c366578
Compare
# limitations under the License. | ||
|
||
[Unit] | ||
Description=Refresh NVIDIA OCI CDI specification file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description=Refresh NVIDIA OCI CDI specification file | |
Description=Refresh NVIDIA CDI specification file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
[Service] | ||
Type=oneshot | ||
# The 30-second delay ensures that dependent services or resources are fully initialized. | ||
# If the rationale for this delay is unclear, consider evaluating whether a shorter delay is sufficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# If the rationale for this delay is unclear, consider evaluating whether a shorter delay is sufficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
d2bb23d
to
077bd25
Compare
[Service] | ||
Type=oneshot | ||
# The 30-second delay ensures that dependent services or resources are fully initialized. | ||
ExecStartPre=/bin/sleep 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we have a 30 second sleep after EVERY event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the only event that actually needs this is install/uninstall.
During an apt install cuda-drivers
, it can take from 10 to 30 seconds from the first line of this logs to the last line
depmod....
Setting up libnvidia-decode-575:amd64 (575.51.03-0ubuntu1) ...
Setting up libnvidia-compute-575:amd64 (575.51.03-0ubuntu1) ...
Setting up libnvidia-encode-575:amd64 (575.51.03-0ubuntu1) ...
Setting up nvidia-utils-575 (575.51.03-0ubuntu1) ...
Setting up nvidia-compute-utils-575 (575.51.03-0ubuntu1) ...
Setting up libnvidia-gl-575:amd64 (575.51.03-0ubuntu1) ...
Setting up nvidia-driver-575 (575.51.03-0ubuntu1) ...
Setting up cuda-drivers-575 (575.51.03-0ubuntu1) ...
Setting up cuda-drivers (575.51.03-0ubuntu1) ...
Processing triggers for mailcap (3.70+nmu1ubuntu1) ...
Processing triggers for desktop-file-utils (0.26-1ubuntu3) ...
Processing triggers for initramfs-tools (0.140ubuntu13.4) ...
update-initramfs: Generating /boot/initrd.img-5.15.0-136-generic
W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast
Processing triggers for gnome-menus (3.36.0-1ubuntu3) ...
Processing triggers for libc-bin (2.35-0ubuntu3.9) ...
Processing triggers for man-db (2.10.2-1) ...
Processing triggers for dbus (1.12.20-2ubuntu4.1) ...
Scanning processes...
Scanning processor microcode...
Scanning linux images...
Running kernel seems to be up-to-date.
The processor microcode seems to be up-to-date.
No services need to be restarted.
No containers need to be restarted.
No user sessions are running outdated binaries.
No VM guests are running outdated hypervisor (qemu) binaries on this host.
so 30 seconds is a safe time to wait for the full DEB install to happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not have an additional package on rpm-based systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean adding the service install as part of the regular RPM install script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the debian packages you added an nvidia-container-toolkit-cdi-refresh
package that includes the systemd unit and udef rules. This seems to be missing from the RPM packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your question, on the DEB side I noticed we have separated specific components into individual packages.
But on the RPM side we only have 1 file (RPM Package def file) so I followed the structure and added the install of the new 3 files in the same RPM def file
packaging
├── debian
│ ├── changelog.old
│ ├── compat
│ ├── control
│ ├── copyright
│ ├── nvidia-container-toolkit-base.install
│ ├── nvidia-container-toolkit-base.postinst
│ ├── nvidia-container-toolkit-cdi-refresh.install
│ ├── nvidia-container-toolkit-cdi-refresh.postinst
│ ├── nvidia-container-toolkit-operator-extensions.install
│ ├── nvidia-container-toolkit.install
│ ├── nvidia-container-toolkit.lintian-overrides
│ ├── nvidia-container-toolkit.postinst
│ ├── nvidia-container-toolkit.postrm
│ ├── prepare
│ └── rules
└── rpm
├── SOURCES
│ └── LICENSE
└── SPECS
└── nvidia-container-toolkit.spec
Type=oneshot | ||
# The 30-second delay ensures that dependent services or resources are fully initialized. | ||
ExecStartPre=/bin/sleep 30 | ||
ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should this path depend on the installation locations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, yes. umm let me think how to mod this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm after checking we /usr/bin
set as default for both deb/rpm, and we don't provide macros to change the install path. So having this hardcoded here looks ok. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then install command installs to:
install -m 755 -t %{buildroot}%{_bindir} nvidia-ctk
If this is always /usr/bin
then we don't neet to update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation, we can leave it as is. - https://docs.fedoraproject.org/en-US/packaging-guidelines/RPMMacros/#macros_installation
077bd25
to
29901e3
Compare
a641d38
to
d150c66
Compare
d8b0cb2
to
5ffb29b
Compare
packaging/debian/control
Outdated
|
||
Package: nvidia-container-toolkit-cdi-refresh | ||
Architecture: any | ||
Depends: ${misc:Depends}, nvidia-container-toolkit-base (= @VERSION@) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that this will not be installed by default. Should we just include the Systemd units in the -base
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, not by default. Maybe yes, it makes sense to move to the base pkg.
%config /etc/systemd/system/nvidia-cdi-refresh.service | ||
%config /etc/systemd/system/nvidia-cdi-refresh.path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does defining these as configs mean that a remove doesn't remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm yes, I was following what I saw on the mig-parted install script, but I guess here we want a different behaviour
packaging/debian/control
Outdated
@@ -29,3 +29,9 @@ Architecture: any | |||
Depends: ${misc:Depends}, nvidia-container-toolkit-base (= @VERSION@) | |||
Description: NVIDIA Container Toolkit Operator Extensions | |||
Provides tools for using the NVIDIA Container Toolkit with the GPU Operator | |||
|
|||
Package: nvidia-container-toolkit-cdi-refresh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I called out in a previous comment, we don't have the same package in the RPM package spec. Could we discuss why these are not just part of the -base
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for consistency against RPM I have merged this into the base
pkg
|
||
[Unit] | ||
Description=Refresh NVIDIA CDI specification file | ||
ConditionPathExists=/usr/bin/nvidia-smi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check for /usr/bin/nvidia-ctk
?
5ffb29b
to
e1ab7af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new systemd service to refresh the NVIDIA CDI specification on driver installation events and updates packaging and Docker build files accordingly.
- Introduces nvidia-cdi-refresh.service and nvidia-cdi-refresh.path for automatically triggering CDI refresh.
- Updates RPM and Debian packaging scripts to install and enable the new service.
- Modifies Dockerfiles for multiple distributions to include the new systemd files.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
packaging/rpm/SPECS/nvidia-container-toolkit.spec | Adds systemd unit sources and installation steps for RPM. |
packaging/debian/nvidia-container-toolkit-base.postinst | Enables the new path unit during post-installation. |
packaging/debian/nvidia-container-toolkit-base.install | Installs the new systemd service and path files. |
docker/Dockerfile.ubuntu | Copies systemd files into the Ubuntu Docker build. |
docker/Dockerfile.rpm-yum | Copies systemd files into the RPM Docker build. |
docker/Dockerfile.opensuse-leap | Copies systemd files into the openSUSE Docker build. |
docker/Dockerfile.debian | Copies systemd files into the Debian Docker build. |
deployments/systemd/nvidia-cdi-refresh.service | Defines the new systemd service for CDI refresh. |
deployments/systemd/nvidia-cdi-refresh.path | Defines the path unit to trigger the CDI refresh service. |
e1ab7af
to
ae30b74
Compare
Automatic regeneration of /var/run/cdi/nvidia.yaml New units: • nvidia-cdi-refresh.service – one-shot wrapper for nvidia-ctk cdi generate (adds sleep + required caps). • nvidia-cdi-refresh.path – fires on driver install/upgrade via modules.dep.bin changes. Packaging • RPM %post reloads systemd and enables the path unit on fresh installs. • DEB postinst does the same (configure, skip on upgrade). Result: CDI spec is always up to date Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ae30b74
to
fb58ffc
Compare
This pull request introduces a new systemd-based service to refresh the NVIDIA Container Device Interface (CDI) specification file upon NVIDIA driver installation or uninstallation. It also updates packaging scripts for Debian and RPM-based distributions to include and configure this service. Below are the most significant changes grouped by theme:
This systemd service covers static package operations:
does not handle:
New Systemd Service for CDI Refresh:
nvidia-cdi-refresh.service
to execute the CDI refresh usingnvidia-ctk
andnvidia-cdi-refresh.path
to monitor driver installation/uninstallation events. These files ensure the service is triggered when driver-related files change. (deployments/systemd/nvidia-cdi-refresh.service
,deployments/systemd/nvidia-cdi-refresh.path
) [1] [2]Updates to Dockerfiles:
debian
,ubuntu
,opensuse-leap
,rpm-yum
) to include the new systemd service and related files in the build process. (docker/Dockerfile.debian
,docker/Dockerfile.ubuntu
,docker/Dockerfile.opensuse-leap
,docker/Dockerfile.rpm-yum
) [1] [2] [3] [4]Debian Packaging Changes:
nvidia-container-toolkit-cdi-refresh
to install the systemd service and path files, along with a post-installation script to enable and start the service on installation. (packaging/debian/control
,packaging/debian/nvidia-container-toolkit-cdi-refresh.install
,packaging/debian/nvidia-container-toolkit-cdi-refresh.postinst
) [1] [2] [3]RPM Packaging Changes:
packaging/rpm/SPECS/nvidia-container-toolkit.spec
) [1] [2] [3]