Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive E2E testing for containerd runtime configuration alongside existing Docker testing infrastructure. The changes introduce a nested container testing framework that allows running tests inside containers to validate NVIDIA Container Toolkit behavior in containerized environments.

  • Adds new E2E tests for containerd drop-in configuration functionality
  • Introduces nvidia-cdi-refresh systemd unit testing
  • Implements nested container runner infrastructure for isolated testing

Reviewed Changes

Copilot reviewed 9 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/go.mod Adds new dependencies for UUID generation and test utilities
tests/e2e/runner.go Implements nested container runner with Docker installation and CTK setup
tests/e2e/nvidia-ctk_containerd_test.go New comprehensive containerd E2E test suite
tests/e2e/nvidia-ctk_docker_test.go Refactors to use shared runner infrastructure and fixes macOS compatibility
tests/e2e/nvidia-cdi-refresh_test.go New systemd unit tests for CDI refresh functionality
tests/e2e/nvidia-container-cli_test.go Refactors to use nested container runner
tests/e2e/installer.go Adds containerd installation template and additional flags support
tests/e2e/e2e_test.go Centralizes test runner initialization in BeforeSuite
tests/e2e/Makefile Documents new test categories

@ArangoGutierrez
Copy link
Collaborator Author

Builds on #1235

Doesn't include #1311 tests for that should be added as a follow up

@coveralls
Copy link

coveralls commented Sep 23, 2025

Pull Request Test Coverage Report for Build 18005738357

Details

  • 0 of 1 (0.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.006%) to 36.277%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/config/engine/containerd/config_drop_in.go 0 1 0.0%
Totals Coverage Status
Change from base Build 17981864462: 0.006%
Covered Lines: 4827
Relevant Lines: 13306

💛 - Coveralls

@ArangoGutierrez
Copy link
Collaborator Author

I'll mark this PR as ready for review once #1235 is merged

@ArangoGutierrez
Copy link
Collaborator Author

I'll mark this PR as ready for review once #1235 is merged

Rebased

@ArangoGutierrez ArangoGutierrez force-pushed the e2e_containerd branch 3 times, most recently from 029af03 to 1899001 Compare September 25, 2025 11:16
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review September 25, 2025 11:26
@elezar elezar marked this pull request as draft October 13, 2025 12:10
@ArangoGutierrez ArangoGutierrez force-pushed the e2e_containerd branch 2 times, most recently from 51ad031 to c65e468 Compare October 13, 2025 13:57
@ArangoGutierrez
Copy link
Collaborator Author

Rebased

@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review October 13, 2025 14:01
AfterAll(func(ctx context.Context) {
// Cleanup: remove the container and the temporary script on the host.
// Use || true to ensure cleanup doesn't fail the test
runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the nolint let's just drop the return values.

Suggested change
runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName)) //nolint:errcheck
_, _, _ = runner.Run(fmt.Sprintf("docker rm -f %s 2>/dev/null || true", containerName))

Does it mak sense to at least WARN if the cleanup fails? The || true doesn't ensure that the test doesn't fail, the fact that we don't check the return value does that.

Comment on lines 56 to 59
# Remove any imports line from the config (reset to original state)
if [ -f /etc/containerd/config.toml ]; then
grep -v "^imports = " /etc/containerd/config.toml > /tmp/config.toml.tmp && mv /tmp/config.toml.tmp /etc/containerd/config.toml || true
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make a copy of the original config and restore that after / before each test?


# Restart containerd to pick up the clean config
systemctl restart containerd
sleep 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to check containerd health?

Comment on lines 80 to 84
output, _, err := nestedContainerRunner.Run(`cat /etc/containerd/conf.d/99-nvidia.toml`)
Expect(err).ToNot(HaveOccurred())
Expect(output).To(ContainSubstring(`nvidia`))
Expect(output).To(ContainSubstring(`nvidia-cdi`))
Expect(output).To(ContainSubstring(`nvidia-legacy`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in person, we are nolonger triggering the configuration of containerd with the current installation mechanism.

output, _, err = nestedContainerRunner.Run(`containerd config dump`)
Expect(err).ToNot(HaveOccurred())
// Verify imports section is in the merged config
Expect(output).To(ContainSubstring(`imports = ['/etc/containerd/conf.d/*.toml']`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that config dump prints that ACTUAL paths of all files processed.

Comment on lines 97 to 137
ContainSubstring(`default_runtime_name = "nvidia"`),
ContainSubstring(`default_runtime_name = 'nvidia'`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should definitely be VERSION specific checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tests that pass in multiple states are prone to be flaky. Please pick the format for the version of containerd that we have installed by default and use that. This makes the differences between SPECIFIC containerd versions more obvious when reading the tests.

ContainSubstring(`default_runtime_name = "nvidia"`),
ContainSubstring(`default_runtime_name = 'nvidia'`),
))
Expect(output).To(ContainSubstring(`enable_cdi = true`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we toggle this behaviour? It is disabled by default.

Comment on lines 105 to 145
ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, thsi should be version-specific.

})

When("containerd already has a custom default runtime configured", func() {
It("should preserve the existing default runtime when --set-as-default=false is specified", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--set-as-default=false is not specified. It is the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we don't strictly speaking need a custom runtime to test the behaviour of NOT overriding the default.

`)
Expect(err).ToNot(HaveOccurred())

// Configure containerd with drop-in config (explicitly not setting as default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we configuring with the drop-in config in this case?

Expect(err).ToNot(HaveOccurred(), "Failed to reset containerd configuration")
})

When("configuring containerd on a Kubernetes node", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We're not on a Kubernetes node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited

When("configuring containerd on a Kubernetes node", func() {
It("should add NVIDIA runtime using drop-in config without modifying the main config", func(ctx context.Context) {
// Configure containerd using nvidia-ctk
cmd := fmt.Sprintf(nvidiaCtkConfigureContainerdCmd, "--nvidia-set-as-default --cdi.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's definitely just use string concatenation here. I would even go so far as to say that we should just duplicate the command every time we want to run it.

Comment on lines 206 to 207
rm -rf /etc/containerd/config.d
mkdir -p /etc/containerd/config.d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing and creating this folder? Does this not contradict us restoring the backups in BeforeEach?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, adopted

Comment on lines 223 to 170
SystemdCgroup = true
CustomOption = "custom-value"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are theses values relevant? Should we check that the applied config includes these settings for the nvidia runtime?

Comment on lines 260 to 261
// NVIDIA runtime should be added
Expect(output).To(ContainSubstring(`nvidia`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very BROAD check.

Comment on lines 258 to 259
// Custom runtime should be preserved with all its options
Expect(output).To(MatchRegexp(`(?s)custom-runtime.*CustomOption.*custom-value`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what this checks. For example the SystemdCgroups setting is not checked.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some unanswered questions below.

Also, the intent of this PR is to test how the merged configs are handled by different containerd versions. I don't feel as if this really achieve this since we're only ever testing the version of containerd that is included in the kind node image.

Can one parameterize the test for 1.7.x and 2.0 so that we can validate the behavior there? (I'm happy to do this as a follow-up, but it would be good to understand how we would handle the expected differences).

Comment on lines +68 to +135
# Backup the original conf.d directory
if [ -d /etc/containerd/conf.d ]; then
cp -r /etc/containerd/conf.d /tmp/containerd-conf.d.backup
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about /etc/containerd/config.toml (if it exists)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Expect(err).ToNot(HaveOccurred(), "Failed to restart containerd after configuration restore")
})

When("configuring containerd on a KIND node", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why reference KIND at all? We only selected the kind nodes since this was the simplest way to get containerd running in a container.

Suggested change
When("configuring containerd on a KIND node", func() {
When("configuring containerd", func() {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 128 to 129
ContainSubstring(`[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]`),
ContainSubstring(`[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As called out a number of times, this check is too broad. We should capture the differences between different containerd versions explicitly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now check depending on the config version

})

When("configuring containerd on a KIND node", func() {
It("should add NVIDIA runtime using drop-in config without modifying the main config", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true for containerd versions that INCLUDE the default /etc/containerd/conf.d imports.

))

// Verify the drop-in config was processed (config dump shows actual imported files)
Expect(output).To(ContainSubstring(`/etc/containerd/conf.d/*.toml`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please load the output config as toml and check specific values instead of checking substrings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now parse the TOML and run the checks from it

Comment on lines 140 to 141
// Verify KIND settings are still present
Expect(output).To(ContainSubstring(`SystemdCgroup = true`))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this a Kind setting? Does the kind node ALWAYS use systemd cgroups?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KIND comes with a preconfigured containerd config.toml file.

Comment on lines 144 to 153
# Add imports directive if not present
if ! grep -q "imports" /etc/containerd/config.toml; then
# Create imports line
cat > /tmp/imports.line <<EOF
imports = ["/etc/containerd/conf.d/*.toml"]
EOF
# Prepend to existing config
cat /etc/containerd/config.toml >> /tmp/imports.line
mv /tmp/imports.line /etc/containerd/config.toml
fi`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this? The nvidia-ctk runtime configure command should do this in cases where the imports directive does not exist.


// Verify config version
version := config.Get("version")
Expect(version).To(Equal(int64(env.configVersion)))
Copy link
Member

@elezar elezar Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does gomega have an EqualValue matcher so that we don't have to worry about the type?

(Alternatively, just update the type of env.configVersion to be int64).

// Verify imports
if env.configVersion == 2 {
// containerd 1.7 shows actual resolved imports
err = validateImports(config, []string{"/etc/containerd/conf.d/99-nvidia.toml"}, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sideeffect of the config dump command and does not reflect the actual content of the config file. One question I do have is why we're using validateImports here and not for the non v2 case?

Expect(err).ToNot(HaveOccurred(), "Failed to get plugin config")

// Verify CDI is enabled
cdiEnabled, err := getCDIEnabled(pluginConfig, env.configVersion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the version relevant here?

Comment on lines 657 to 670
containerdSection := pluginConfig.Get("containerd")
if containerdSection == nil {
return "", fmt.Errorf("containerd section not found")
}

containerdTree, ok := containerdSection.(*toml.Tree)
if !ok {
return "", fmt.Errorf("containerd section is not a TOML tree")
}

defaultRuntime := containerdTree.Get("default_runtime_name")
if defaultRuntime == nil {
return "", nil // No default runtime set
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use GetPath? or Get("containerd.default_runtime_name")?

}

// validateRuntimeConfig validates a specific runtime configuration
func validateRuntimeConfig(runtime interface{}, expectedType string, expectedOptions map[string]interface{}) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under which conditions do we ever call this with anything other than a toml.Tree?

}

// Check runtime type
runtimeType, ok := runtimeMap["runtime_type"].(string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something that we can check quite simply at the call site -- especially since we seem to mostly call this with runtimeType == "".

// For empty expectedType, we don't check the runtime type
// For nvidia runtime, we accept various runtime types
if expectedType == "nvidia" {
validTypes := []string{"io.containerd.runc.v2", "io.containerd.nvidia.v1", "io.containerd.nvidia.v2"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this ever io.containerd.nvidia.v1? Where does this list come from?

if expectedType != "" {
// For empty expectedType, we don't check the runtime type
// For nvidia runtime, we accept various runtime types
if expectedType == "nvidia" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never seem to call this with expectedType == "nvidia" what is the purpose of this branch?

return fmt.Errorf("options is not a map or toml.Tree, got %T", v)
}

for key, expectedValue := range expectedOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a gomega ContainsElements or ConsistsOf instead?

Comment on lines 136 to 145
if !env.hasDefaultImports {
// For containerd 1.7.x, ensure a config file exists
// nvidia-ctk should add the imports directive automatically
_, _, err = nestedContainerRunner.Run(`
if [ ! -f /etc/containerd/config.toml ]; then
mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml
fi`)
Expect(err).ToNot(HaveOccurred(), "Failed to create default containerd config")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why this is even required? The nvidia-ctk will create this file if it does not exist. It really should not be needed to create this before the tests.

Comment on lines 125 to 146
// Ensure containerd is running
_, _, err = nestedContainerRunner.Run(`
# Start containerd if not running
if ! systemctl is-active --quiet containerd; then
systemctl start containerd
sleep 2
fi
`)
Expect(err).ToNot(HaveOccurred(), "Failed to ensure containerd is running")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this IN addition to restartContainerdAndWait? Is there a reason that we can't just call that here if required?

Comment on lines 178 to 184
if [ -f /tmp/containerd-config.toml.backup ]; then
cp /tmp/containerd-config.toml.backup /etc/containerd/config.toml
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we REMOVE /etc/containerd/config.toml if /tmp/containerd-config.toml.backup does not exist?

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants