-
Couldn't load subscription status.
- Fork 427
[no-relnote] Add E2E for containerd #1313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive E2E testing for containerd runtime configuration alongside existing Docker testing infrastructure. The changes introduce a nested container testing framework that allows running tests inside containers to validate NVIDIA Container Toolkit behavior in containerized environments.
- Adds new E2E tests for containerd drop-in configuration functionality
- Introduces nvidia-cdi-refresh systemd unit testing
- Implements nested container runner infrastructure for isolated testing
Reviewed Changes
Copilot reviewed 9 out of 32 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/go.mod | Adds new dependencies for UUID generation and test utilities |
| tests/e2e/runner.go | Implements nested container runner with Docker installation and CTK setup |
| tests/e2e/nvidia-ctk_containerd_test.go | New comprehensive containerd E2E test suite |
| tests/e2e/nvidia-ctk_docker_test.go | Refactors to use shared runner infrastructure and fixes macOS compatibility |
| tests/e2e/nvidia-cdi-refresh_test.go | New systemd unit tests for CDI refresh functionality |
| tests/e2e/nvidia-container-cli_test.go | Refactors to use nested container runner |
| tests/e2e/installer.go | Adds containerd installation template and additional flags support |
| tests/e2e/e2e_test.go | Centralizes test runner initialization in BeforeSuite |
| tests/e2e/Makefile | Documents new test categories |
Pull Request Test Coverage Report for Build 18005738357Details
💛 - Coveralls |
5c85056 to
3dc480c
Compare
|
I'll mark this PR as ready for review once #1235 is merged |
3dc480c to
563a192
Compare
Rebased |
029af03 to
1899001
Compare
51ad031 to
c65e468
Compare
|
Rebased |
| AfterAll(func(ctx context.Context) { | ||
| // Cleanup: remove the container and the temporary script on the host. | ||
| // Use || true to ensure cleanup doesn't fail the test | ||
| runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the nolint let's just drop the return values.
| runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck | |
| _, _, _ = runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) |
Does it mak sense to at least WARN if the cleanup fails? The || true doesn't ensure that the test doesn't fail, the fact that we don't check the return value does that.
| # Remove any imports line from the config (reset to original state) | ||
| if [ -f /etc/containerd/config.toml ]; then | ||
| grep -v "^imports = " /etc/containerd/config.toml > /tmp/config.toml.tmp && mv /tmp/config.toml.tmp /etc/containerd/config.toml || true | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make a copy of the original config and restore that after / before each test?
|
|
||
| # Restart containerd to pick up the clean config | ||
| systemctl restart containerd | ||
| sleep 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to check containerd health?
| output, _, err := nestedContainerRunner.Run(`cat /etc/containerd/conf.d/99-nvidia.toml`) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
| Expect(output).To(ContainSubstring(`nvidia`)) | ||
| Expect(output).To(ContainSubstring(`nvidia-cdi`)) | ||
| Expect(output).To(ContainSubstring(`nvidia-legacy`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in person, we are nolonger triggering the configuration of containerd with the current installation mechanism.
| output, _, err = nestedContainerRunner.Run(`containerd config dump`) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
| // Verify imports section is in the merged config | ||
| Expect(output).To(ContainSubstring(`imports = ['/etc/containerd/conf.d/*.toml']`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think that config dump prints that ACTUAL paths of all files processed.
| ContainSubstring(`default_runtime_name = "nvidia"`), | ||
| ContainSubstring(`default_runtime_name = 'nvidia'`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should definitely be VERSION specific checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having tests that pass in multiple states are prone to be flaky. Please pick the format for the version of containerd that we have installed by default and use that. This makes the differences between SPECIFIC containerd versions more obvious when reading the tests.
| ContainSubstring(`default_runtime_name = "nvidia"`), | ||
| ContainSubstring(`default_runtime_name = 'nvidia'`), | ||
| )) | ||
| Expect(output).To(ContainSubstring(`enable_cdi = true`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we toggle this behaviour? It is disabled by default.
| ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`), | ||
| ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, thsi should be version-specific.
| }) | ||
|
|
||
| When("containerd already has a custom default runtime configured", func() { | ||
| It("should preserve the existing default runtime when --set-as-default=false is specified", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--set-as-default=false is not specified. It is the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we don't strictly speaking need a custom runtime to test the behaviour of NOT overriding the default.
| `) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
|
|
||
| // Configure containerd with drop-in config (explicitly not setting as default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we configuring with the drop-in config in this case?
| Expect(err).ToNot(HaveOccurred(), "Failed to reset containerd configuration") | ||
| }) | ||
|
|
||
| When("configuring containerd on a Kubernetes node", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We're not on a Kubernetes node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edited
| When("configuring containerd on a Kubernetes node", func() { | ||
| It("should add NVIDIA runtime using drop-in config without modifying the main config", func(ctx context.Context) { | ||
| // Configure containerd using nvidia-ctk | ||
| cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--nvidia-set-as-default --cdi.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's definitely just use string concatenation here. I would even go so far as to say that we should just duplicate the command every time we want to run it.
| rm -rf /etc/containerd/config.d | ||
| mkdir -p /etc/containerd/config.d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing and creating this folder? Does this not contradict us restoring the backups in BeforeEach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, adopted
| SystemdCgroup = true | ||
| CustomOption = "custom-value" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are theses values relevant? Should we check that the applied config includes these settings for the nvidia runtime?
| // NVIDIA runtime should be added | ||
| Expect(output).To(ContainSubstring(`nvidia`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very BROAD check.
| // Custom runtime should be preserved with all its options | ||
| Expect(output).To(MatchRegexp(`(?s)custom-runtime.*CustomOption.*custom-value`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what this checks. For example the SystemdCgroups setting is not checked.
ef6d766 to
07e70a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some unanswered questions below.
Also, the intent of this PR is to test how the merged configs are handled by different containerd versions. I don't feel as if this really achieve this since we're only ever testing the version of containerd that is included in the kind node image.
Can one parameterize the test for 1.7.x and 2.0 so that we can validate the behavior there? (I'm happy to do this as a follow-up, but it would be good to understand how we would handle the expected differences).
| # Backup the original conf.d directory | ||
| if [ -d /etc/containerd/conf.d ]; then | ||
| cp -r /etc/containerd/conf.d /tmp/containerd-conf.d.backup | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about /etc/containerd/config.toml (if it exists)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| Expect(err).ToNot(HaveOccurred(), "Failed to restart containerd after configuration restore") | ||
| }) | ||
|
|
||
| When("configuring containerd on a KIND node", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why reference KIND at all? We only selected the kind nodes since this was the simplest way to get containerd running in a container.
| When("configuring containerd on a KIND node", func() { | |
| When("configuring containerd", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`), | ||
| ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As called out a number of times, this check is too broad. We should capture the differences between different containerd versions explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now check depending on the config version
| }) | ||
|
|
||
| When("configuring containerd on a KIND node", func() { | ||
| It("should add NVIDIA runtime using drop-in config without modifying the main config", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true for containerd versions that INCLUDE the default /etc/containerd/conf.d imports.
| )) | ||
|
|
||
| // Verify the drop-in config was processed (config dump shows actual imported files) | ||
| Expect(output).To(ContainSubstring(`/etc/containerd/conf.d/*.toml`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please load the output config as toml and check specific values instead of checking substrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now parse the TOML and run the checks from it
| // Verify KIND settings are still present | ||
| Expect(output).To(ContainSubstring(`SystemdCgroup = true`)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this a Kind setting? Does the kind node ALWAYS use systemd cgroups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KIND comes with a preconfigured containerd config.toml file.
07e70a7 to
34346af
Compare
| # Add imports directive if not present | ||
| if ! grep -q "imports" /etc/containerd/config.toml; then | ||
| # Create imports line | ||
| cat > /tmp/imports.line <<EOF | ||
| imports = ["/etc/containerd/conf.d/*.toml"] | ||
| EOF | ||
| # Prepend to existing config | ||
| cat /etc/containerd/config.toml >> /tmp/imports.line | ||
| mv /tmp/imports.line /etc/containerd/config.toml | ||
| fi`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this? The nvidia-ctk runtime configure command should do this in cases where the imports directive does not exist.
|
|
||
| // Verify config version | ||
| version := config.Get("version") | ||
| Expect(version).To(Equal(int64(env.configVersion))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does gomega have an EqualValue matcher so that we don't have to worry about the type?
(Alternatively, just update the type of env.configVersion to be int64).
| // Verify imports | ||
| if env.configVersion == 2 { | ||
| // containerd 1.7 shows actual resolved imports | ||
| err = validateImports(config, []string{"/etc/containerd/conf.d/99-nvidia.toml"}, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a sideeffect of the config dump command and does not reflect the actual content of the config file. One question I do have is why we're using validateImports here and not for the non v2 case?
| Expect(err).ToNot(HaveOccurred(), "Failed to get plugin config") | ||
|
|
||
| // Verify CDI is enabled | ||
| cdiEnabled, err := getCDIEnabled(pluginConfig, env.configVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the version relevant here?
| containerdSection := pluginConfig.Get("containerd") | ||
| if containerdSection == nil { | ||
| return "", fmt.Errorf("containerd section not found") | ||
| } | ||
|
|
||
| containerdTree, ok := containerdSection.(*toml.Tree) | ||
| if !ok { | ||
| return "", fmt.Errorf("containerd section is not a TOML tree") | ||
| } | ||
|
|
||
| defaultRuntime := containerdTree.Get("default_runtime_name") | ||
| if defaultRuntime == nil { | ||
| return "", nil // No default runtime set | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use GetPath? or Get("containerd.default_runtime_name")?
| } | ||
|
|
||
| // validateRuntimeConfig validates a specific runtime configuration | ||
| func validateRuntimeConfig(runtime interface{}, expectedType string, expectedOptions map[string]interface{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which conditions do we ever call this with anything other than a toml.Tree?
| } | ||
|
|
||
| // Check runtime type | ||
| runtimeType, ok := runtimeMap["runtime_type"].(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like something that we can check quite simply at the call site -- especially since we seem to mostly call this with runtimeType == "".
| // For empty expectedType, we don't check the runtime type | ||
| // For nvidia runtime, we accept various runtime types | ||
| if expectedType == "nvidia" { | ||
| validTypes := []string{"io.containerd.runc.v2", "io.containerd.nvidia.v1", "io.containerd.nvidia.v2"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this ever io.containerd.nvidia.v1? Where does this list come from?
| if expectedType != "" { | ||
| // For empty expectedType, we don't check the runtime type | ||
| // For nvidia runtime, we accept various runtime types | ||
| if expectedType == "nvidia" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never seem to call this with expectedType == "nvidia" what is the purpose of this branch?
| return fmt.Errorf("options is not a map or toml.Tree, got %T", v) | ||
| } | ||
|
|
||
| for key, expectedValue := range expectedOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a gomega ContainsElements or ConsistsOf instead?
34346af to
899fec9
Compare
| if !env.hasDefaultImports { | ||
| // For containerd 1.7.x, ensure a config file exists | ||
| // nvidia-ctk should add the imports directive automatically | ||
| _, _, err = nestedContainerRunner.Run(` | ||
| if [ ! -f /etc/containerd/config.toml ]; then | ||
| mkdir -p /etc/containerd | ||
| containerd config default > /etc/containerd/config.toml | ||
| fi`) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to create default containerd config") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why this is even required? The nvidia-ctk will create this file if it does not exist. It really should not be needed to create this before the tests.
| // Ensure containerd is running | ||
| _, _, err = nestedContainerRunner.Run(` | ||
| # Start containerd if not running | ||
| if ! systemctl is-active --quiet containerd; then | ||
| systemctl start containerd | ||
| sleep 2 | ||
| fi | ||
| `) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to ensure containerd is running") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this IN addition to restartContainerdAndWait? Is there a reason that we can't just call that here if required?
| if [ -f /tmp/containerd-config.toml.backup ]; then | ||
| cp /tmp/containerd-config.toml.backup /etc/containerd/config.toml | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we REMOVE /etc/containerd/config.toml if /tmp/containerd-config.toml.backup does not exist?
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
899fec9 to
f82172f
Compare
No description provided.