Skip to content

Conversation

@ericfirth
Copy link
Contributor

Fixes race condition in 'aggregates multiple checkpoints into DDSketch histograms' test.

Root Cause:
The test captured 'now = Time.now.to_f' at the start, but the checkpoints
created immediately after used their own Core::Utils::Time.now internally.
These timestamps could differ by milliseconds, causing checkpoints to be
bucketed into a different time window than the test expected.

When the test calculated the expected bucket:
bucket_time_ns = now_ns - (now_ns % processor.bucket_size_ns)

And looked it up:
bucket = processor.buckets[bucket_time_ns]

The bucket would be nil if the checkpoints used a slightly different
timestamp and landed in an adjacent bucket.

Solution:
Use Core::Utils::Time.now_provider to inject a fixed timestamp
(Time.utc(2000, 1, 1, 0, 0, 0)) for the entire test. This ensures
all checkpoints use the exact same deterministic timestamp,
eliminating the race condition completely.

The test logic remains unchanged - we're just controlling the time
source to make it deterministic rather than dependent on wall clock
timing.

Related to incident #46145
Seed that reproduced the flake: 61335

What does this PR do?

Motivation:

Change log entry

Additional Notes:

How to test the change?

@ericfirth ericfirth requested a review from a team as a code owner November 25, 2025 17:18
@github-actions
Copy link

👋 Hey @DataDog/ruby-guild, please fill "Change log entry" section in the pull request description.

If changes need to be present in CHANGELOG.md you can state it this way

**Change log entry**

Yes. A brief summary to be placed into the CHANGELOG.md

(possible answers Yes/Yep/Yeah)

Or you can opt out like that

**Change log entry**

None.

(possible answers No/Nope/None)

Visited at: 2025-11-25 17:18:17 UTC

@github-actions github-actions bot added the dev/testing Involves testing processes (e.g. RSpec) label Nov 25, 2025
@pr-commenter
Copy link

pr-commenter bot commented Nov 25, 2025

Benchmarks

Benchmark execution time: 2025-11-26 18:46:04

Comparing candidate commit 77d3041 in PR branch eric.firth/fix-ddsketch-flake with baseline commit f045f9d in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 44 metrics, 2 unstable metrics.

@p-datadog
Copy link
Member

👍 on the general approach but it looks like the tests in question are still intermittently failing with this PR.

Would be nice to have the time fixing code extracted into a helper but I can do that either in this PR or in a follow-up PR.

@ericfirth ericfirth force-pushed the eric.firth/fix-ddsketch-flake branch from 73a6044 to 1d45c21 Compare November 26, 2025 14:33
@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented Nov 26, 2025

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage
Patch Coverage: 88.89%
Total Coverage: 95.16% (-0.00%)

View detailed report

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 77d3041 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@ericfirth
Copy link
Contributor Author

ericfirth commented Nov 26, 2025

👍 on the general approach but it looks like the tests in question are still intermittently failing with this PR.

Would be nice to have the time fixing code extracted into a helper but I can do that either in this PR or in a follow-up PR.

@p-datadog
Perhaps something like

with_frozen_time do
  # ... test stuff
end

# with a time set in case thats important
with_frozen_time(Time.utc(2000, 1, 1, 0, 0, 0)) do
   # ... test stuff
end

Fixes two race conditions in 'aggregates multiple checkpoints into DDSketch histograms' test.

Root Cause 1 - Timestamp Race:
The test captured 'now = Time.now.to_f' at the start, but the checkpoints
created immediately after used their own Core::Utils::Time.now internally.
These timestamps could differ by milliseconds, causing checkpoints to be
bucketed into a different time window than the test expected, resulting in
nil bucket lookups.

Root Cause 2 - Background Worker Race:
The processor's background worker thread periodically calls perform(), which
drains the event buffer via process_events() and then clears buckets via
flush_stats(). This could happen before the test checks buckets, resulting
in empty buckets.

Solution:
1. Use Core::Utils::Time.now_provider to inject a fixed timestamp
   (Time.utc(2000, 1, 1, 0, 0, 0)) ensuring all checkpoints use the
   exact same deterministic timestamp.

2. Stop the background worker with processor.stop(true) before creating
   checkpoints, preventing it from interfering with the test's manual
   event processing and bucket inspection.

3. Add diagnostic messages showing expected vs actual bucket keys when
   assertions fail to aid future debugging.

Related to incident #46145
@ericfirth ericfirth force-pushed the eric.firth/fix-ddsketch-flake branch from 1d45c21 to 1502223 Compare November 26, 2025 15:23
@ericfirth ericfirth requested a review from p-datadog November 26, 2025 16:16
Simplifies approach per PR discussion:
- Remove with_frozen_time helper
- Use standard RSpec stub for time
- Keep aggregate_failures for better test output
- Explicit frozen_time variable for clarity
@ericfirth ericfirth requested review from Strech and y9v November 26, 2025 18:18
Copy link
Member

@y9v y9v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for fixing this!

allow(Datadog::Core::Utils::Time).to receive(:now).and_return(frozen_time)
allow(Datadog::Tracing).to receive(:active_span).and_return(nil)

processor.stop(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this line - is the processor leaked from another test? If it's made for this test and not used then is it possible to not make it at all?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev/testing Involves testing processes (e.g. RSpec)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants