Skip to content

[BUG] postgres DBM: monotonic memory growth in cluster checks runner (agent 7.78.0) leading to OOMKill every 3-4h per pod #50270

@alexnick-glow

Description

@alexnick-glow

Agent Environment

  • Agent: 7.78.0
  • Datadog Helm chart: datadog/datadog 3.201.6
  • Deployment role: clusterChecksRunner (DD_CLC_RUNNER_ENABLED=true)
  • Replicas: 3
  • Resources: requests.memory: 2Gi, limits.memory: 4Gi
  • Platform: AWS EKS 1.31, x86_64, cgroup v2
  • Target: AWS RDS for PostgreSQL 18.1, 1 primary + 2 read replicas, IAM auth (IRSA)

What happened

The agent container in the datadog-clusterchecks Deployment is OOMKilled by the kernel roughly every 3–4 hours per pod, time-correlated across the 3 replicas after each cluster-check rebalance.

container.memory.working_set rises monotonically from container startup until it hits the cgroup limit. There is no plateau.

The most informative signal: memory keeps growing on a CCR pod even after the cluster agent rebalances the primary postgres instance check off of it. RSS does not drop when a check leaves a pod. This is what points at retention rather than sizing.

We previously raised limits.memory from 3Gi → 4Gi. It only delayed the OOM; the slope is unchanged.

What we expected

Steady-state RSS proportional to currently-assigned scope, with state released when a check instance is unscheduled from a runner during rebalance.

Configuration

Postgres DBM cluster check config (postgres.yaml mounted via the cluster agent's confd):

cluster_check: true
init_config:
instances:
  - dbm: true
    host: <primary>
    port: 5432
    username: datadog
    ssl: require
    empty_default_hostname: false
    exclude_hostname: true
    aws:
      managed_authentication:
        enabled: true
    database_autodiscovery:
      enabled: true
      include:
        - <15 application databases>
    relations:
      - relation_regex: .*
    tags: [..., role:primary]

  - dbm: true
    host: <replica-1>
    # ...same as primary except no `relations` block
    tags: [..., role:replica]

  - dbm: true
    host: <replica-2>
    # ...same as primary except no `relations` block
    tags: [..., role:replica]
  • Resulting scope on the primary: ~1,330 relations across 15 databases
  • No max_relations, no custom min_collection_interval, no query_samples overrides

What we ruled out

  • Connection growthpg_stat_activity shows a stable count of datadog sessions per CCR pod over the leak window.
  • Query samples spikequery_samples on defaults; datadog.postgres.* ingestion rate is steady.
  • Slowing checks — per-instance datadog.postgres.collection.time is stable across the leak window.
  • Searched integrations-core CHANGELOG and open issues for recent postgres memory fixes — couldn't find one that matches this rebalance-retention signature.

Additional info available on request

  • agent status and agent configcheck from a CCR pod
  • 7d container.memory.working_set graph (sawtooth from probe restarts; previously cliffs from OOM)
  • Full rendered postgres.yaml
  • tracemalloc / Python memory-tracking snapshot taken just before the probe trips on a single pod

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions