Skip to content

Conversation

@iamluc
Copy link
Contributor

@iamluc iamluc commented Jan 12, 2026

What does this PR do?

Introduces a new library_injection package with a provider-based architecture for APM library injection.
This refactoring extracts the injection logic into:

  • InjectAPMLibraries high-level function orchestrating the injection flow
  • libraryInjectionProvider interface with InjectInjector and InjectLibrary methods
  • initContainerProvider implementation using init containers and EmptyDir volumes (current implementation)

The apmInjectionMutator function in namespace_mutator.go is simplified to delegate to libraryinjection.InjectAPMLibraries.

Review Strategy

We recommend reviewing this PR commit by commit for easier understanding of the changes.

Commits Overview
  1. refactor(ssi): introduce libraryinjection provider for APM injection
    Core refactoring that introduces the new library_injection package with a provider-based architecture. This abstracts the injection mechanism behind an interface (libraryInjectionProvider) with an initial implementation using init containers (initContainerProvider).

  2. Add tests
    Adds unit tests for the library_injection package (pod_patcher_test.go, init_container_test.go) and additional integration tests in auto_instrumentation_test.go to ensure compatibility with the existing behavior.

  3. Cleanup dead code
    Removes legacy code that is no longer used after the refactoring: injector.go, lib_requirement.go, unused methods in language_versions.go, and various helper functions.

  4. Move functions to their right locations
    Improves code organization by moving functions to the files where they are actually used (e.g., getNamespaceLabels, extractLibrariesFromAnnotationstarget_mutator.go).

  5. Add release note
    Adds the release note for the Cluster Agent changelog.

Motivation

This refactoring prepares the codebase for upcoming alternative injection modes, such as CSI driver-based injection. By abstracting the injection mechanism behind a provider interface, new injection strategies can be added without modifying the core mutation logic.

Describe how you validated your changes

  • All existing tests pass: go test -tags "kubeapiserver test" ./pkg/clusteragent/admission/mutate/autoinstrumentation/...
  • Added new tests to validate image configuration, annotation handling, and environment variable injection

Additional Notes

@iamluc iamluc force-pushed the luc/ssi-refacto-library-injection branch from cc0cdcc to 3f01216 Compare January 12, 2026 11:19
@github-actions github-actions bot added long review PR is complex, plan time to review it team/container-platform The Container Platform Team team/injection-platform labels Jan 12, 2026
@agent-platform-auto-pr
Copy link
Contributor

agent-platform-auto-pr bot commented Jan 12, 2026

Go Package Import Differences

Baseline: 340919a
Comparison: 007ddd1

binaryosarchchange
cluster-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection
cluster-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection

@cit-pr-commenter
Copy link

cit-pr-commenter bot commented Jan 12, 2026

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 10653287-2120-4a4c-a047-4caa4c6e1c4b

Baseline: 340919a
Comparison: 007ddd1
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization +1.90 [-1.15, +4.95] 1 Logs

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization +1.90 [-1.15, +4.95] 1 Logs
tcp_syslog_to_blackhole ingress throughput +0.78 [+0.73, +0.84] 1 Logs
quality_gate_idle_all_features memory utilization +0.39 [+0.36, +0.43] 1 Logs bounds checks dashboard
ddot_metrics_sum_cumulativetodelta_exporter memory utilization +0.37 [+0.14, +0.60] 1 Logs
quality_gate_idle memory utilization +0.24 [+0.20, +0.29] 1 Logs bounds checks dashboard
quality_gate_metrics_logs memory utilization +0.17 [-0.04, +0.38] 1 Logs bounds checks dashboard
docker_containers_memory memory utilization +0.13 [+0.05, +0.20] 1 Logs
ddot_metrics memory utilization +0.11 [-0.12, +0.34] 1 Logs
otlp_ingest_logs memory utilization +0.08 [-0.02, +0.17] 1 Logs
file_to_blackhole_500ms_latency egress throughput +0.07 [-0.30, +0.45] 1 Logs
file_tree memory utilization +0.05 [+0.00, +0.10] 1 Logs
file_to_blackhole_100ms_latency egress throughput +0.04 [-0.00, +0.09] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.03 [-0.35, +0.42] 1 Logs
uds_dogstatsd_to_api_v3 ingress throughput +0.02 [-0.11, +0.14] 1 Logs
uds_dogstatsd_to_api ingress throughput +0.01 [-0.10, +0.13] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.08, +0.07] 1 Logs
file_to_blackhole_1000ms_latency egress throughput -0.03 [-0.43, +0.37] 1 Logs
ddot_metrics_sum_cumulative memory utilization -0.14 [-0.30, +0.02] 1 Logs
quality_gate_logs % cpu utilization -0.14 [-1.61, +1.32] 1 Logs bounds checks dashboard
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.28 [-0.33, -0.22] 1 Logs
ddot_metrics_sum_delta memory utilization -0.30 [-0.50, -0.10] 1 Logs
otlp_ingest_metrics memory utilization -0.49 [-0.65, -0.34] 1 Logs
ddot_logs memory utilization -0.78 [-0.84, -0.72] 1 Logs

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed links
docker_containers_cpu simple_check_run 10/10
docker_containers_memory memory_usage 10/10
docker_containers_memory simple_check_run 10/10
file_to_blackhole_0ms_latency lost_bytes 10/10
file_to_blackhole_0ms_latency memory_usage 10/10
file_to_blackhole_1000ms_latency lost_bytes 10/10
file_to_blackhole_1000ms_latency memory_usage 10/10
file_to_blackhole_100ms_latency lost_bytes 10/10
file_to_blackhole_100ms_latency memory_usage 10/10
file_to_blackhole_500ms_latency lost_bytes 10/10
file_to_blackhole_500ms_latency memory_usage 10/10
quality_gate_idle intake_connections 10/10 bounds checks dashboard
quality_gate_idle memory_usage 10/10 bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 bounds checks dashboard
quality_gate_logs intake_connections 10/10 bounds checks dashboard
quality_gate_logs lost_bytes 10/10 bounds checks dashboard
quality_gate_logs memory_usage 10/10 bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 bounds checks dashboard
quality_gate_metrics_logs lost_bytes 10/10 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.

@iamluc iamluc force-pushed the luc/ssi-refacto-library-injection branch from ba41a0c to 007ddd1 Compare January 12, 2026 13:05
@iamluc iamluc added the qa/done QA done before merge and regressions are covered by tests label Jan 12, 2026
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betterengineering I think we should move mutate/autoinstrumentation/annotation.go to mutate/common/annotation.go so we can reuse it in library_injection.

But I didn't want to do it in this PR to keep the review simpler.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts:

  • Our module should be self contained
    • My PR to refactor annotations moved everything into the autoinstrumentation module. The issue with shared usage we've had so far is that business logic ends up being implicit across webhooks and really hard to change. LabelSelectors was the other big one.
  • We should have submodules
    • I like that you've added one for library injection! If we want to reuse the annotation code, we should make another submodule or potentially leverage pkg/ssi and add a module there. I'm ok with small modules for now until we get it right. For example, make a autoinstrumentation/annotation module until we figure out where it belongs.
  • We shouldn't duplicate our code, even temporarily
    • I would much rather we have a somewhat useless module that only does annotations that can be imported everywhere then have two files for annotations, even temporarily.

@iamluc iamluc marked this pull request as ready for review January 12, 2026 13:54
@iamluc iamluc requested review from a team as code owners January 12, 2026 13:54
@agent-platform-auto-pr
Copy link
Contributor

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 340919a
📊 Static Quality Gates Dashboard

Successful checks

Info

Quality gate Change Size (prev → curr → max)
docker_cluster_agent_amd64 -32.59 KiB (0.02% reduction) 180.761 → 180.729 → 181.080
docker_cluster_agent_arm64 -64.61 KiB (0.03% reduction) 196.618 → 196.555 → 198.490
29 successful checks with minimal change (< 2 KiB)
Quality gate Current Size
agent_deb_amd64 705.841 MiB
agent_deb_amd64_fips 701.138 MiB
agent_heroku_amd64 326.864 MiB
agent_msi 571.327 MiB
agent_rpm_amd64 705.828 MiB
agent_rpm_amd64_fips 701.125 MiB
agent_rpm_arm64 687.324 MiB
agent_rpm_arm64_fips 683.465 MiB
agent_suse_amd64 705.828 MiB
agent_suse_amd64_fips 701.125 MiB
agent_suse_arm64 687.324 MiB
agent_suse_arm64_fips 683.465 MiB
docker_agent_amd64 767.575 MiB
docker_agent_arm64 773.692 MiB
docker_agent_jmx_amd64 958.454 MiB
docker_agent_jmx_arm64 953.290 MiB
docker_cws_instrumentation_amd64 7.135 MiB
docker_cws_instrumentation_arm64 6.689 MiB
docker_dogstatsd_amd64 38.785 MiB
docker_dogstatsd_arm64 37.128 MiB
dogstatsd_deb_amd64 30.004 MiB
dogstatsd_deb_arm64 28.156 MiB
dogstatsd_rpm_amd64 30.004 MiB
dogstatsd_suse_amd64 30.004 MiB
iot_agent_deb_amd64 43.002 MiB
iot_agent_deb_arm64 40.123 MiB
iot_agent_deb_armhf 40.704 MiB
iot_agent_rpm_amd64 43.003 MiB
iot_agent_suse_amd64 43.003 MiB
On-wire sizes (compressed)
Quality gate Change Size (prev → curr → max)
agent_deb_amd64 -16.73 KiB (0.01% reduction) 173.537 → 173.521 → 174.490
agent_deb_amd64_fips -40.62 KiB (0.02% reduction) 172.422 → 172.383 → 173.750
agent_heroku_amd64 +2.7 KiB (0.00% increase) 87.108 → 87.110 → 88.450
agent_msi +16.0 KiB (0.01% increase) 142.863 → 142.879 → 143.020
agent_rpm_amd64 -45.27 KiB (0.03% reduction) 176.140 → 176.096 → 177.660
agent_rpm_amd64_fips -7.57 KiB (0.00% reduction) 175.349 → 175.342 → 176.600
agent_rpm_arm64 +7.09 KiB (0.00% increase) 159.559 → 159.566 → 161.260
agent_rpm_arm64_fips +6.86 KiB (0.00% increase) 158.910 → 158.917 → 160.550
agent_suse_amd64 -45.27 KiB (0.03% reduction) 176.140 → 176.096 → 177.660
agent_suse_amd64_fips -7.57 KiB (0.00% reduction) 175.349 → 175.342 → 176.600
agent_suse_arm64 +7.09 KiB (0.00% increase) 159.559 → 159.566 → 161.260
agent_suse_arm64_fips +6.86 KiB (0.00% increase) 158.910 → 158.917 → 160.550
docker_agent_amd64 +2.67 KiB (0.00% increase) 261.164 → 261.166 → 262.450
docker_agent_arm64 neutral 250.219 MiB
docker_agent_jmx_amd64 neutral 329.802 MiB
docker_agent_jmx_arm64 neutral 314.837 MiB
docker_cluster_agent_amd64 +5.59 KiB (0.01% increase) 63.849 → 63.855 → 64.490
docker_cluster_agent_arm64 -15.43 KiB (0.03% reduction) 60.148 → 60.133 → 61.170
docker_cws_instrumentation_amd64 neutral 2.994 MiB
docker_cws_instrumentation_arm64 neutral 2.726 MiB
docker_dogstatsd_amd64 neutral 15.017 MiB
docker_dogstatsd_arm64 neutral 14.348 MiB
dogstatsd_deb_amd64 neutral 7.937 MiB
dogstatsd_deb_arm64 neutral 6.817 MiB
dogstatsd_rpm_amd64 neutral 7.949 MiB
dogstatsd_suse_amd64 neutral 7.949 MiB
iot_agent_deb_amd64 neutral 11.265 MiB
iot_agent_deb_arm64 -2.69 KiB (0.03% reduction) 9.633 → 9.630 → 10.450
iot_agent_deb_armhf neutral 9.827 MiB
iot_agent_rpm_amd64 +2.41 KiB (0.02% increase) 11.278 → 11.281 → 12.060
iot_agent_suse_amd64 +2.41 KiB (0.02% increase) 11.278 → 11.281 → 12.060

Copy link
Member

@betterengineering betterengineering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good stab at isolating the injection mutation code. I think we have more work to do here, which can either happen in follow up PRs or in this change.

containerNames: defaultContainerNames,
},
},
"custom library image via annotation is used": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a duplicate of:

local sdk injection with custom library image gets custom image

},
},
// All supported languages tests
"all supported languages can be injected simultaneously": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"all supported languages can be injected simultaneously": {
"all supported languages can be injected simultaneously through local SDK injection": {

},
containerNames: defaultContainerNames,
expectedAnnotations: map[string]string{
"cluster-autoscaler.kubernetes.io/safe-to-evict-local-volumes": "datadog-auto-instrumentation,datadog-auto-instrumentation-etc",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently test for this in pkg/ssi/testutils during RequireInjection, but it doesn't hurt to add here:

K8sAutoscalerSafeToEvictVolumesAnnotation: "datadog-auto-instrumentation,datadog-auto-instrumentation-etc",

---
other:
- |
APM: Refactor APM auto-instrumentation library injection to use a provider-based architecture.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't think this change needs a release note. I think it makes sense to reserve release notes for when there is a customer facing behavior change. You can use changelog/no-changelog to allow CI to be ok with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts:

  • Our module should be self contained
    • My PR to refactor annotations moved everything into the autoinstrumentation module. The issue with shared usage we've had so far is that business logic ends up being implicit across webhooks and really hard to change. LabelSelectors was the other big one.
  • We should have submodules
    • I like that you've added one for library injection! If we want to reuse the annotation code, we should make another submodule or potentially leverage pkg/ssi and add a module there. I'm ok with small modules for now until we get it right. For example, make a autoinstrumentation/annotation module until we figure out where it belongs.
  • We shouldn't duplicate our code, even temporarily
    • I would much rather we have a somewhat useless module that only does annotations that can be imported everywhere then have two files for annotations, even temporarily.

workloadmeta "github.com/DataDog/datadog-agent/comp/core/workloadmeta/def"
"github.com/DataDog/datadog-agent/pkg/clusteragent/admission/common"
"github.com/DataDog/datadog-agent/pkg/clusteragent/admission/metrics"
libraryinjection "github.com/DataDog/datadog-agent/pkg/clusteragent/admission/mutate/autoinstrumentation/library_injection"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this module should be tweaked. The simplest change is to remove the underscore from the module name. That way, we don't need to libraryinjection prefix on the import.

But maybe you and I can workshop a more specific name? Maybe, provider? The best name for the module will be the same as the core interface it provides

}

// LibraryConfig contains the configuration needed to inject a language-specific tracing library.
type LibraryConfig struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for your change, but we should standardize this piece. I propose the following structs: main...mark.spicer/refactor-lib-info

Let me create the new structs as a module before the EOD. That way we can start to use them and are not stepping on each others toes.

// Different implementations can use different mechanisms:
// - initContainerProvider: Uses init containers with EmptyDir volumes
// - (future) CSI provider: Uses a CSI driver to mount library files
type libraryInjectionProvider interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this interface! I think we could make it even simpler. Here would by my dream interface for this module:

type Provider interface {
	Inject(pod *corev1.Pod, libs []Library) Result
}

To support this, the injector would need to be a library as well, which may get awkward. I also notice you have some other metadata that needs passed in. Maybe an expanded version:

type Provider interface {
	Inject(ctx InjectContext, pod *corev1.Pod, injector Injector, libs []Library) Result
}

But I think I want this module to have one method that does everything I need. I will always need the injector.


//go:build kubeapiserver

package libraryinjection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the _test pattern. It's not just pedantic. I think it forces us to write good tests and think about the public interfaces we are creating. One issue we had before was because the tests had access to everything, we didn't create good interfaces and we tested it as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

long review PR is complex, plan time to review it qa/done QA done before merge and regressions are covered by tests team/container-platform The Container Platform Team team/injection-platform

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants