refactor(ssi): introduce libraryinjection provider for APM injection #44972

iamluc · 2026-01-12T11:15:15Z

What does this PR do?

Introduces a new library_injection package with a provider-based architecture for APM library injection.
This refactoring extracts the injection logic into:

InjectAPMLibraries high-level function orchestrating the injection flow
libraryInjectionProvider interface with InjectInjector and InjectLibrary methods
initContainerProvider implementation using init containers and EmptyDir volumes (current implementation)

The apmInjectionMutator function in namespace_mutator.go is simplified to delegate to libraryinjection.InjectAPMLibraries.

Review Strategy

We recommend reviewing this PR commit by commit for easier understanding of the changes.

Commits Overview

refactor(ssi): introduce libraryinjection provider for APM injection
Core refactoring that introduces the new library_injection package with a provider-based architecture. This abstracts the injection mechanism behind an interface (libraryInjectionProvider) with an initial implementation using init containers (initContainerProvider).
Add tests
Adds unit tests for the library_injection package (pod_patcher_test.go, init_container_test.go) and additional integration tests in auto_instrumentation_test.go to ensure compatibility with the existing behavior.
Cleanup dead code
Removes legacy code that is no longer used after the refactoring: injector.go, lib_requirement.go, unused methods in language_versions.go, and various helper functions.
Move functions to their right locations
Improves code organization by moving functions to the files where they are actually used (e.g., getNamespaceLabels, extractLibrariesFromAnnotations → target_mutator.go).
Add release note
Adds the release note for the Cluster Agent changelog.

Motivation

This refactoring prepares the codebase for upcoming alternative injection modes, such as CSI driver-based injection. By abstracting the injection mechanism behind a provider interface, new injection strategies can be added without modifying the core mutation logic.

Describe how you validated your changes

All existing tests pass: go test -tags "kubeapiserver test" ./pkg/clusteragent/admission/mutate/autoinstrumentation/...
Added new tests to validate image configuration, annotation handling, and environment variable injection

Additional Notes

agent-platform-auto-pr · 2026-01-12T11:40:05Z

Go Package Import Differences

Baseline: 340919a
Comparison: 007ddd1

binary

os

arch

change

cluster-agent

linux

amd64

+1, -0

+github.com/DataDog/datadog-agent/pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection

cluster-agent

linux

arm64

+1, -0

+github.com/DataDog/datadog-agent/pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection

cit-pr-commenter · 2026-01-12T12:09:17Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 10653287-2120-4a4c-a047-4caa4c6e1c4b

Baseline: 340919a
Comparison: 007ddd1
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+1.90	[-1.15, +4.95]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+1.90	[-1.15, +4.95]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+0.78	[+0.73, +0.84]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.39	[+0.36, +0.43]	1	Logs bounds checks dashboard
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	+0.37	[+0.14, +0.60]	1	Logs
➖	quality_gate_idle	memory utilization	+0.24	[+0.20, +0.29]	1	Logs bounds checks dashboard
➖	quality_gate_metrics_logs	memory utilization	+0.17	[-0.04, +0.38]	1	Logs bounds checks dashboard
➖	docker_containers_memory	memory utilization	+0.13	[+0.05, +0.20]	1	Logs
➖	ddot_metrics	memory utilization	+0.11	[-0.12, +0.34]	1	Logs
➖	otlp_ingest_logs	memory utilization	+0.08	[-0.02, +0.17]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.07	[-0.30, +0.45]	1	Logs
➖	file_tree	memory utilization	+0.05	[+0.00, +0.10]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	+0.04	[-0.00, +0.09]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.03	[-0.35, +0.42]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	+0.02	[-0.11, +0.14]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.01	[-0.10, +0.13]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.08, +0.07]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.03	[-0.43, +0.37]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	-0.14	[-0.30, +0.02]	1	Logs
➖	quality_gate_logs	% cpu utilization	-0.14	[-1.61, +1.32]	1	Logs bounds checks dashboard
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	-0.28	[-0.33, -0.22]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	-0.30	[-0.50, -0.10]	1	Logs
➖	otlp_ingest_metrics	memory utilization	-0.49	[-0.65, -0.34]	1	Logs
➖	ddot_logs	memory utilization	-0.78	[-0.84, -0.72]	1	Logs

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	docker_containers_cpu	simple_check_run	10/10
✅	docker_containers_memory	memory_usage	10/10
✅	docker_containers_memory	simple_check_run	10/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	lost_bytes	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.

iamluc · 2026-01-12T13:20:07Z

pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection/annotation.go

@betterengineering I think we should move mutate/autoinstrumentation/annotation.go to mutate/common/annotation.go so we can reuse it in library_injection.

But I didn't want to do it in this PR to keep the review simpler.

WDYT?

Some thoughts:

Our module should be self contained

My PR to refactor annotations moved everything into the autoinstrumentation module. The issue with shared usage we've had so far is that business logic ends up being implicit across webhooks and really hard to change. LabelSelectors was the other big one.

We should have submodules

I like that you've added one for library injection! If we want to reuse the annotation code, we should make another submodule or potentially leverage pkg/ssi and add a module there. I'm ok with small modules for now until we get it right. For example, make a autoinstrumentation/annotation module until we figure out where it belongs.

We shouldn't duplicate our code, even temporarily

I would much rather we have a somewhat useless module that only does annotations that can be imported everywhere then have two files for annotations, even temporarily.

agent-platform-auto-pr · 2026-01-12T14:07:14Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 340919a
📊 Static Quality Gates Dashboard

Successful checks

Info

	Quality gate	Change	Size (prev → curr → max)
✅	docker_cluster_agent_amd64	-32.59 KiB (0.02% reduction)	180.761 → 180.729 → 181.080
✅	docker_cluster_agent_arm64	-64.61 KiB (0.03% reduction)	196.618 → 196.555 → 198.490

29 successful checks with minimal change (< 2 KiB)

	Quality gate	Current Size
✅	agent_deb_amd64	705.841 MiB
✅	agent_deb_amd64_fips	701.138 MiB
✅	agent_heroku_amd64	326.864 MiB
✅	agent_msi	571.327 MiB
✅	agent_rpm_amd64	705.828 MiB
✅	agent_rpm_amd64_fips	701.125 MiB
✅	agent_rpm_arm64	687.324 MiB
✅	agent_rpm_arm64_fips	683.465 MiB
✅	agent_suse_amd64	705.828 MiB
✅	agent_suse_amd64_fips	701.125 MiB
✅	agent_suse_arm64	687.324 MiB
✅	agent_suse_arm64_fips	683.465 MiB
✅	docker_agent_amd64	767.575 MiB
✅	docker_agent_arm64	773.692 MiB
✅	docker_agent_jmx_amd64	958.454 MiB
✅	docker_agent_jmx_arm64	953.290 MiB
✅	docker_cws_instrumentation_amd64	7.135 MiB
✅	docker_cws_instrumentation_arm64	6.689 MiB
✅	docker_dogstatsd_amd64	38.785 MiB
✅	docker_dogstatsd_arm64	37.128 MiB
✅	dogstatsd_deb_amd64	30.004 MiB
✅	dogstatsd_deb_arm64	28.156 MiB
✅	dogstatsd_rpm_amd64	30.004 MiB
✅	dogstatsd_suse_amd64	30.004 MiB
✅	iot_agent_deb_amd64	43.002 MiB
✅	iot_agent_deb_arm64	40.123 MiB
✅	iot_agent_deb_armhf	40.704 MiB
✅	iot_agent_rpm_amd64	43.003 MiB
✅	iot_agent_suse_amd64	43.003 MiB

On-wire sizes (compressed)

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	-16.73 KiB (0.01% reduction)	173.537 → 173.521 → 174.490
✅	agent_deb_amd64_fips	-40.62 KiB (0.02% reduction)	172.422 → 172.383 → 173.750
✅	agent_heroku_amd64	+2.7 KiB (0.00% increase)	87.108 → 87.110 → 88.450
✅	agent_msi	+16.0 KiB (0.01% increase)	142.863 → 142.879 → 143.020
✅	agent_rpm_amd64	-45.27 KiB (0.03% reduction)	176.140 → 176.096 → 177.660
✅	agent_rpm_amd64_fips	-7.57 KiB (0.00% reduction)	175.349 → 175.342 → 176.600
✅	agent_rpm_arm64	+7.09 KiB (0.00% increase)	159.559 → 159.566 → 161.260
✅	agent_rpm_arm64_fips	+6.86 KiB (0.00% increase)	158.910 → 158.917 → 160.550
✅	agent_suse_amd64	-45.27 KiB (0.03% reduction)	176.140 → 176.096 → 177.660
✅	agent_suse_amd64_fips	-7.57 KiB (0.00% reduction)	175.349 → 175.342 → 176.600
✅	agent_suse_arm64	+7.09 KiB (0.00% increase)	159.559 → 159.566 → 161.260
✅	agent_suse_arm64_fips	+6.86 KiB (0.00% increase)	158.910 → 158.917 → 160.550
✅	docker_agent_amd64	+2.67 KiB (0.00% increase)	261.164 → 261.166 → 262.450
✅	docker_agent_arm64	neutral	250.219 MiB
✅	docker_agent_jmx_amd64	neutral	329.802 MiB
✅	docker_agent_jmx_arm64	neutral	314.837 MiB
✅	docker_cluster_agent_amd64	+5.59 KiB (0.01% increase)	63.849 → 63.855 → 64.490
✅	docker_cluster_agent_arm64	-15.43 KiB (0.03% reduction)	60.148 → 60.133 → 61.170
✅	docker_cws_instrumentation_amd64	neutral	2.994 MiB
✅	docker_cws_instrumentation_arm64	neutral	2.726 MiB
✅	docker_dogstatsd_amd64	neutral	15.017 MiB
✅	docker_dogstatsd_arm64	neutral	14.348 MiB
✅	dogstatsd_deb_amd64	neutral	7.937 MiB
✅	dogstatsd_deb_arm64	neutral	6.817 MiB
✅	dogstatsd_rpm_amd64	neutral	7.949 MiB
✅	dogstatsd_suse_amd64	neutral	7.949 MiB
✅	iot_agent_deb_amd64	neutral	11.265 MiB
✅	iot_agent_deb_arm64	-2.69 KiB (0.03% reduction)	9.633 → 9.630 → 10.450
✅	iot_agent_deb_armhf	neutral	9.827 MiB
✅	iot_agent_rpm_amd64	+2.41 KiB (0.02% increase)	11.278 → 11.281 → 12.060
✅	iot_agent_suse_amd64	+2.41 KiB (0.02% increase)	11.278 → 11.281 → 12.060

betterengineering

I think this is a good stab at isolating the injection mutation code. I think we have more work to do here, which can either happen in follow up PRs or in this change.

betterengineering · 2026-01-12T19:31:46Z

pkg/clusteragent/admission/mutate/autoinstrumentation/auto_instrumentation_test.go

+				containerNames: defaultContainerNames,
+			},
+		},
+		"custom library image via annotation is used": {


This test is a duplicate of:

local sdk injection with custom library image gets custom image

betterengineering · 2026-01-12T19:34:19Z

pkg/clusteragent/admission/mutate/autoinstrumentation/auto_instrumentation_test.go

+			},
+		},
+		// All supported languages tests
+		"all supported languages can be injected simultaneously": {


Suggested change

"all supported languages can be injected simultaneously": {

"all supported languages can be injected simultaneously through local SDK injection": {

betterengineering · 2026-01-12T19:37:05Z

pkg/clusteragent/admission/mutate/autoinstrumentation/auto_instrumentation_test.go

+				},
+				containerNames: defaultContainerNames,
+				expectedAnnotations: map[string]string{
+					"cluster-autoscaler.kubernetes.io/safe-to-evict-local-volumes": "datadog-auto-instrumentation,datadog-auto-instrumentation-etc",


We currently test for this in pkg/ssi/testutils during RequireInjection, but it doesn't hurt to add here:

datadog-agent/pkg/ssi/testutils/pod.go

Line 106 in 5964cf0

K8sAutoscalerSafeToEvictVolumesAnnotation: "datadog-auto-instrumentation,datadog-auto-instrumentation-etc",

betterengineering · 2026-01-12T19:41:14Z

releasenotes-dca/notes/refactor-library-injection-provider-842fec2fd5ae6c72.yaml

+---
+other:
+  - |
+    APM: Refactor APM auto-instrumentation library injection to use a provider-based architecture.


I personally don't think this change needs a release note. I think it makes sense to reserve release notes for when there is a customer facing behavior change. You can use changelog/no-changelog to allow CI to be ok with it.

betterengineering · 2026-01-12T19:53:00Z

pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection/annotation.go

Some thoughts:

Our module should be self contained

My PR to refactor annotations moved everything into the autoinstrumentation module. The issue with shared usage we've had so far is that business logic ends up being implicit across webhooks and really hard to change. LabelSelectors was the other big one.

We should have submodules

I like that you've added one for library injection! If we want to reuse the annotation code, we should make another submodule or potentially leverage pkg/ssi and add a module there. I'm ok with small modules for now until we get it right. For example, make a autoinstrumentation/annotation module until we figure out where it belongs.

We shouldn't duplicate our code, even temporarily

I would much rather we have a somewhat useless module that only does annotations that can be imported everywhere then have two files for annotations, even temporarily.

betterengineering · 2026-01-12T20:00:36Z

pkg/clusteragent/admission/mutate/autoinstrumentation/namespace_mutator.go

 	workloadmeta "github.com/DataDog/datadog-agent/comp/core/workloadmeta/def"
 	"github.com/DataDog/datadog-agent/pkg/clusteragent/admission/common"
 	"github.com/DataDog/datadog-agent/pkg/clusteragent/admission/metrics"
+	libraryinjection "github.com/DataDog/datadog-agent/pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection"


The name of this module should be tweaked. The simplest change is to remove the underscore from the module name. That way, we don't need to libraryinjection prefix on the import.

But maybe you and I can workshop a more specific name? Maybe, provider? The best name for the module will be the same as the core interface it provides

betterengineering · 2026-01-12T20:03:41Z

pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection/provider.go

+}
+
+// LibraryConfig contains the configuration needed to inject a language-specific tracing library.
+type LibraryConfig struct {


This looks good for your change, but we should standardize this piece. I propose the following structs: main...mark.spicer/refactor-lib-info

Let me create the new structs as a module before the EOD. That way we can start to use them and are not stepping on each others toes.

betterengineering · 2026-01-12T20:15:34Z

pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection/provider.go

+// Different implementations can use different mechanisms:
+// - initContainerProvider: Uses init containers with EmptyDir volumes
+// - (future) CSI provider: Uses a CSI driver to mount library files
+type libraryInjectionProvider interface {


I love this interface! I think we could make it even simpler. Here would by my dream interface for this module:

type Provider interface { Inject(pod *corev1.Pod, libs []Library) Result }

To support this, the injector would need to be a library as well, which may get awkward. I also notice you have some other metadata that needs passed in. Maybe an expanded version:

type Provider interface { Inject(ctx InjectContext, pod *corev1.Pod, injector Injector, libs []Library) Result }

But I think I want this module to have one method that does everything I need. I will always need the injector.

betterengineering · 2026-01-12T20:17:41Z

pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection/init_container_test.go

+
+//go:build kubeapiserver
+
+package libraryinjection


I really like the _test pattern. It's not just pedantic. I think it forces us to write good tests and think about the public interfaces we are creating. One issue we had before was because the tests had access to everything, we didn't create good interfaces and we tested it as is.

iamluc added 2 commits January 12, 2026 12:19

refactor(ssi): introduce libraryinjection provider for APM injection

87f766f

Add tests

3f01216

iamluc force-pushed the luc/ssi-refacto-library-injection branch from cc0cdcc to 3f01216 Compare January 12, 2026 11:19

github-actions bot added long review PR is complex, plan time to review it team/container-platform The Container Platform Team team/injection-platform labels Jan 12, 2026

iamluc added 3 commits January 12, 2026 14:04

Cleanup dead code

c633212

Move functions to their right locations

fb85cdc

Add release note

007ddd1

iamluc force-pushed the luc/ssi-refacto-library-injection branch from ba41a0c to 007ddd1 Compare January 12, 2026 13:05

iamluc added the qa/done QA done before merge and regressions are covered by tests label Jan 12, 2026

iamluc commented Jan 12, 2026

View reviewed changes

iamluc marked this pull request as ready for review January 12, 2026 13:54

iamluc requested review from a team as code owners January 12, 2026 13:54

rtrieu approved these changes Jan 12, 2026

View reviewed changes

betterengineering approved these changes Jan 12, 2026

View reviewed changes

	"all supported languages can be injected simultaneously": {
	"all supported languages can be injected simultaneously through local SDK injection": {

refactor(ssi): introduce libraryinjection provider for APM injection #44972

Are you sure you want to change the base?

refactor(ssi): introduce libraryinjection provider for APM injection #44972

Uh oh!

Conversation

iamluc commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Review Strategy

Commits Overview

Motivation

Describe how you validated your changes

Additional Notes

Uh oh!

agent-platform-auto-pr bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go Package Import Differences

Uh oh!

cit-pr-commenter bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agent-platform-auto-pr bot commented Jan 12, 2026

Static quality checks

Info

Uh oh!

betterengineering left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iamluc commented Jan 12, 2026 •

edited

Loading

agent-platform-auto-pr bot commented Jan 12, 2026 •

edited

Loading

cit-pr-commenter bot commented Jan 12, 2026 •

edited

Loading