Skip to content

Commit 04cada1

Browse files
committed
test(antithesis): enable forwarder disk persistence — flags a log-amplification bug
Sample forwarder_storage_max_size_in_bytes 50/50 on/off with forwarder_storage_path on a persistent compose volume, so the on-disk retry queue and restart-recovery paths run for the first time. BUG this branch surfaces: with persistence on, a network partition fills the disk-backed retry queue, and the forwarder logs error! per failed retry attempt (io.rs:462/472/421). Over a large backlog that is unbounded log amplification — it floods per-moment output, tripping 'very high output ... fail to materialize' at cx=134896 on run 4ecf6d1b, which masks other findings. The same path also opens the non-atomic torn-write hunt at persisted.rs:184 under node termination.
1 parent dd0c580 commit 04cada1

2 files changed

Lines changed: 25 additions & 0 deletions

File tree

test/antithesis/deploy/docker-compose.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ services:
4141
- dogstatsd-socket:/var/run/datadog
4242
# first_sample_config (workload) writes this timeline's datadog.yaml + ready sentinel here.
4343
- agent-config:/agent-config:ro
44+
# Forwarder on-disk retry queue. Persists across node termination so restart can recover it.
45+
- forwarder-storage:/var/lib/adp-storage
4446
depends_on:
4547
intake:
4648
condition: service_healthy
@@ -69,3 +71,4 @@ services:
6971
volumes:
7072
dogstatsd-socket:
7173
agent-config:
74+
forwarder-storage:

test/antithesis/harness/src/bin/first_sample_config/config.rs

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,16 @@ fn sample_buffer_size<R: Rng + ?Sized>(rng: &mut R) -> u64 {
175175
}
176176
}
177177

178+
/// Forwarder on-disk retry cap. Half the time a real size so disk persistence is
179+
/// on and the persisted-retry path runs, half the time 0 for in-memory-only.
180+
fn sample_storage_max_bytes<R: Rng + ?Sized>(rng: &mut R) -> u64 {
181+
if rng.random_ratio(1, 2) {
182+
0
183+
} else {
184+
rng.random_range(1_048_576..=268_435_456)
185+
}
186+
}
187+
178188
impl DogStatsdConfig {
179189
/// Sample the `DogStatsD` options from `rng`, taking the socket from the
180190
/// environment.
@@ -229,6 +239,16 @@ pub(crate) struct DatadogConfig {
229239
/// with [`Probe`] so it often lands small enough for the workload to reach
230240
/// and exercise the cap, and occasionally large to probe the headroom.
231241
aggregate_context_limit: u64,
242+
/// Forwarder on-disk retry cap, `forwarder_storage_max_size_in_bytes`. ADP
243+
/// defaults to 0, which disables disk persistence and leaves the persisted
244+
/// retry path dead. Sampled half the time nonzero to turn persistence on,
245+
/// half the time 0 to cover the in-memory-only path.
246+
#[serde(rename = "forwarder_storage_max_size_in_bytes")]
247+
forwarder_storage_max_size_bytes: u64,
248+
/// Forwarder storage directory, `forwarder_storage_path`. A mounted volume
249+
/// that survives node termination, so a restart can recover the queue.
250+
#[serde(rename = "forwarder_storage_path")]
251+
forwarder_storage_path: &'static str,
232252
/// `DogStatsD` options, flattened to top-level `dogstatsd_*` keys.
233253
#[serde(flatten)]
234254
dogstatsd: DogStatsdConfig,
@@ -247,6 +267,8 @@ impl DatadogConfig {
247267
dd_url: dd_url.to_owned(),
248268
log_level: rng.random(),
249269
aggregate_context_limit: Probe.sample(rng),
270+
forwarder_storage_max_size_bytes: sample_storage_max_bytes(rng),
271+
forwarder_storage_path: "/var/lib/adp-storage",
250272
dogstatsd: DogStatsdConfig::sample(rng, dogstatsd_socket),
251273
}
252274
}

0 commit comments

Comments
 (0)