Agent Environment
- Agent:
7.78.0
- Datadog Helm chart:
datadog/datadog 3.201.6
- Deployment role:
clusterChecksRunner (DD_CLC_RUNNER_ENABLED=true)
- Replicas: 3
- Resources:
requests.memory: 2Gi, limits.memory: 4Gi
- Platform: AWS EKS 1.31, x86_64, cgroup v2
- Target: AWS RDS for PostgreSQL 18.1, 1 primary + 2 read replicas, IAM auth (IRSA)
What happened
The agent container in the datadog-clusterchecks Deployment is OOMKilled by the kernel roughly every 3–4 hours per pod, time-correlated across the 3 replicas after each cluster-check rebalance.
container.memory.working_set rises monotonically from container startup until it hits the cgroup limit. There is no plateau.
The most informative signal: memory keeps growing on a CCR pod even after the cluster agent rebalances the primary postgres instance check off of it. RSS does not drop when a check leaves a pod. This is what points at retention rather than sizing.
We previously raised limits.memory from 3Gi → 4Gi. It only delayed the OOM; the slope is unchanged.
What we expected
Steady-state RSS proportional to currently-assigned scope, with state released when a check instance is unscheduled from a runner during rebalance.
Configuration
Postgres DBM cluster check config (postgres.yaml mounted via the cluster agent's confd):
cluster_check: true
init_config:
instances:
- dbm: true
host: <primary>
port: 5432
username: datadog
ssl: require
empty_default_hostname: false
exclude_hostname: true
aws:
managed_authentication:
enabled: true
database_autodiscovery:
enabled: true
include:
- <15 application databases>
relations:
- relation_regex: .*
tags: [..., role:primary]
- dbm: true
host: <replica-1>
# ...same as primary except no `relations` block
tags: [..., role:replica]
- dbm: true
host: <replica-2>
# ...same as primary except no `relations` block
tags: [..., role:replica]
- Resulting scope on the primary: ~1,330 relations across 15 databases
- No
max_relations, no custom min_collection_interval, no query_samples overrides
What we ruled out
- Connection growth —
pg_stat_activity shows a stable count of datadog sessions per CCR pod over the leak window.
- Query samples spike —
query_samples on defaults; datadog.postgres.* ingestion rate is steady.
- Slowing checks — per-instance
datadog.postgres.collection.time is stable across the leak window.
- Searched integrations-core CHANGELOG and open issues for recent
postgres memory fixes — couldn't find one that matches this rebalance-retention signature.
Additional info available on request
agent status and agent configcheck from a CCR pod
- 7d
container.memory.working_set graph (sawtooth from probe restarts; previously cliffs from OOM)
- Full rendered
postgres.yaml
tracemalloc / Python memory-tracking snapshot taken just before the probe trips on a single pod
Agent Environment
7.78.0datadog/datadog 3.201.6clusterChecksRunner(DD_CLC_RUNNER_ENABLED=true)requests.memory: 2Gi,limits.memory: 4GiWhat happened
The
agentcontainer in thedatadog-clusterchecksDeployment is OOMKilled by the kernel roughly every 3–4 hours per pod, time-correlated across the 3 replicas after each cluster-check rebalance.container.memory.working_setrises monotonically from container startup until it hits the cgroup limit. There is no plateau.The most informative signal: memory keeps growing on a CCR pod even after the cluster agent rebalances the primary postgres instance check off of it. RSS does not drop when a check leaves a pod. This is what points at retention rather than sizing.
We previously raised
limits.memoryfrom 3Gi → 4Gi. It only delayed the OOM; the slope is unchanged.What we expected
Steady-state RSS proportional to currently-assigned scope, with state released when a check instance is unscheduled from a runner during rebalance.
Configuration
Postgres DBM cluster check config (
postgres.yamlmounted via the cluster agent'sconfd):max_relations, no custommin_collection_interval, noquery_samplesoverridesWhat we ruled out
pg_stat_activityshows a stable count ofdatadogsessions per CCR pod over the leak window.query_sampleson defaults;datadog.postgres.*ingestion rate is steady.datadog.postgres.collection.timeis stable across the leak window.postgresmemory fixes — couldn't find one that matches this rebalance-retention signature.Additional info available on request
agent statusandagent configcheckfrom a CCR podcontainer.memory.working_setgraph (sawtooth from probe restarts; previously cliffs from OOM)postgres.yamltracemalloc/ Python memory-tracking snapshot taken just before the probe trips on a single pod