[BUG] postgres DBM: monotonic memory growth in cluster checks runner (agent 7.78.0) leading to OOMKill every 3-4h per pod

### Agent Environment

- Agent: `7.78.0`
- Datadog Helm chart: `datadog/datadog 3.201.6`
- Deployment role: `clusterChecksRunner` (`DD_CLC_RUNNER_ENABLED=true`)
- Replicas: 3
- Resources: `requests.memory: 2Gi`, `limits.memory: 4Gi`
- Platform: AWS EKS 1.31, x86_64, cgroup v2
- Target: AWS RDS for **PostgreSQL 18.1**, 1 primary + 2 read replicas, IAM auth (IRSA)

### What happened

The `agent` container in the `datadog-clusterchecks` Deployment is OOMKilled by the kernel roughly **every 3–4 hours per pod**, time-correlated across the 3 replicas after each cluster-check rebalance.

`container.memory.working_set` rises monotonically from container startup until it hits the cgroup limit. There is no plateau.

The most informative signal: **memory keeps growing on a CCR pod even after the cluster agent rebalances the primary postgres instance check off of it**. RSS does not drop when a check leaves a pod. This is what points at retention rather than sizing.

We previously raised `limits.memory` from 3Gi → 4Gi. It only delayed the OOM; the slope is unchanged.

### What we expected

Steady-state RSS proportional to currently-assigned scope, with state released when a check instance is unscheduled from a runner during rebalance.

### Configuration

Postgres DBM cluster check config (`postgres.yaml` mounted via the cluster agent's `confd`):

```yaml
cluster_check: true
init_config:
instances:
  - dbm: true
    host: <primary>
    port: 5432
    username: datadog
    ssl: require
    empty_default_hostname: false
    exclude_hostname: true
    aws:
      managed_authentication:
        enabled: true
    database_autodiscovery:
      enabled: true
      include:
        - <15 application databases>
    relations:
      - relation_regex: .*
    tags: [..., role:primary]

  - dbm: true
    host: <replica-1>
    # ...same as primary except no `relations` block
    tags: [..., role:replica]

  - dbm: true
    host: <replica-2>
    # ...same as primary except no `relations` block
    tags: [..., role:replica]
```

- Resulting scope on the primary: **~1,330 relations across 15 databases**
- No `max_relations`, no custom `min_collection_interval`, no `query_samples` overrides

### What we ruled out

- **Connection growth** — `pg_stat_activity` shows a stable count of `datadog` sessions per CCR pod over the leak window.
- **Query samples spike** — `query_samples` on defaults; `datadog.postgres.*` ingestion rate is steady.
- **Slowing checks** — per-instance `datadog.postgres.collection.time` is stable across the leak window.
- Searched integrations-core CHANGELOG and open issues for recent `postgres` memory fixes — couldn't find one that matches this rebalance-retention signature.

### Additional info available on request

- `agent status` and `agent configcheck` from a CCR pod
- 7d `container.memory.working_set` graph (sawtooth from probe restarts; previously cliffs from OOM)
- Full rendered `postgres.yaml`
- `tracemalloc` / Python memory-tracking snapshot taken just before the probe trips on a single pod


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] postgres DBM: monotonic memory growth in cluster checks runner (agent 7.78.0) leading to OOMKill every 3-4h per pod #50270

Agent Environment

What happened

What we expected

Configuration

What we ruled out

Additional info available on request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] postgres DBM: monotonic memory growth in cluster checks runner (agent 7.78.0) leading to OOMKill every 3-4h per pod #50270

Description

Agent Environment

What happened

What we expected

Configuration

What we ruled out

Additional info available on request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions