-
Notifications
You must be signed in to change notification settings - Fork 399
Fix flaky data_streams bucket aggregation test #5093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
👋 Hey @DataDog/ruby-guild, please fill "Change log entry" section in the pull request description. If changes need to be present in CHANGELOG.md you can state it this way **Change log entry**
Yes. A brief summary to be placed into the CHANGELOG.md(possible answers Yes/Yep/Yeah) Or you can opt out like that **Change log entry**
None.(possible answers No/Nope/None) Visited at: 2025-11-25 17:18:17 UTC |
BenchmarksBenchmark execution time: 2025-11-26 18:46:04 Comparing candidate commit 77d3041 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 44 metrics, 2 unstable metrics. |
|
👍 on the general approach but it looks like the tests in question are still intermittently failing with this PR. Would be nice to have the time fixing code extracted into a helper but I can do that either in this PR or in a follow-up PR. |
73a6044 to
1d45c21
Compare
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage 🔗 Commit SHA: 77d3041 | Docs | Datadog PR Page | Was this helpful? Give us feedback! |
@p-datadog with_frozen_time do
# ... test stuff
end
# with a time set in case thats important
with_frozen_time(Time.utc(2000, 1, 1, 0, 0, 0)) do
# ... test stuff
end |
Fixes two race conditions in 'aggregates multiple checkpoints into DDSketch histograms' test. Root Cause 1 - Timestamp Race: The test captured 'now = Time.now.to_f' at the start, but the checkpoints created immediately after used their own Core::Utils::Time.now internally. These timestamps could differ by milliseconds, causing checkpoints to be bucketed into a different time window than the test expected, resulting in nil bucket lookups. Root Cause 2 - Background Worker Race: The processor's background worker thread periodically calls perform(), which drains the event buffer via process_events() and then clears buckets via flush_stats(). This could happen before the test checks buckets, resulting in empty buckets. Solution: 1. Use Core::Utils::Time.now_provider to inject a fixed timestamp (Time.utc(2000, 1, 1, 0, 0, 0)) ensuring all checkpoints use the exact same deterministic timestamp. 2. Stop the background worker with processor.stop(true) before creating checkpoints, preventing it from interfering with the test's manual event processing and bucket inspection. 3. Add diagnostic messages showing expected vs actual bucket keys when assertions fail to aid future debugging. Related to incident #46145
1d45c21 to
1502223
Compare
Simplifies approach per PR discussion: - Remove with_frozen_time helper - Use standard RSpec stub for time - Keep aggregate_failures for better test output - Explicit frozen_time variable for clarity
y9v
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for fixing this!
| allow(Datadog::Core::Utils::Time).to receive(:now).and_return(frozen_time) | ||
| allow(Datadog::Tracing).to receive(:active_span).and_return(nil) | ||
|
|
||
| processor.stop(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this line - is the processor leaked from another test? If it's made for this test and not used then is it possible to not make it at all?
Fixes race condition in 'aggregates multiple checkpoints into DDSketch histograms' test.
Root Cause:
The test captured 'now = Time.now.to_f' at the start, but the checkpoints
created immediately after used their own Core::Utils::Time.now internally.
These timestamps could differ by milliseconds, causing checkpoints to be
bucketed into a different time window than the test expected.
When the test calculated the expected bucket:
bucket_time_ns = now_ns - (now_ns % processor.bucket_size_ns)
And looked it up:
bucket = processor.buckets[bucket_time_ns]
The bucket would be nil if the checkpoints used a slightly different
timestamp and landed in an adjacent bucket.
Solution:
Use Core::Utils::Time.now_provider to inject a fixed timestamp
(Time.utc(2000, 1, 1, 0, 0, 0)) for the entire test. This ensures
all checkpoints use the exact same deterministic timestamp,
eliminating the race condition completely.
The test logic remains unchanged - we're just controlling the time
source to make it deterministic rather than dependent on wall clock
timing.
Related to incident #46145
Seed that reproduced the flake: 61335
What does this PR do?
Motivation:
Change log entry
Additional Notes:
How to test the change?