Skip to content

Commit 246d8cd

Browse files
committed
[PROF-10128] Workaround heap profiling flaky spec on Ruby 3.3+
**What does this PR do?** This PR is a second attempt at a workaround for this flaky spec: * https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15742/workflows/9da2d5f2-3a15-4659-9643-219f52b56b5d/jobs/572636 * https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15799/workflows/983ab3f8-2b83-4fd7-b0bc-43f257b924c3/jobs/574214 * https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15817/workflows/e67525cd-b5d3-435e-9eb2-ee345a8a344d/jobs/574731 ``` Failures: 1) Datadog::Profiling::StackRecorder#stats with heap profiling enabled includes heap recorder snapshot Failure/Error: expect(stack_recorder.stats).to match( hash_including( :heap_recorder_snapshot => hash_including( # Records for dead objects should have gone away :num_object_records => live_heap_samples + age0_heap_samples, # We allocate from 3 different locations in this test but only 2 # of them are for objects which should be alive at serialization time :num_heap_records => 2, # The update done during serialization should reflect the expected {:heap_recorder_snapshot=>{:last_update_objects_alive=>7, :last_update_objects_dead=>9, :last_update_...ation_time_ns_max=>221980, :serialization_time_ns_min=>221980, :serialization_time_ns_total=>221980} to match #<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e250 @expected={:heap_recorder_snapshot=>#<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e318 @expected={:num_object_records=>9, :num_heap_records=>2, :last_update_objects_alive=>6, :last_update_objects_dead=>10, :last_update_objects_skipped=>3, :last_update_objects_frozen=>3}>}> Diff: @@ -1,12 +1,8 @@ -#<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e250 - @expected= - {:heap_recorder_snapshot=> - #<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e318 - @expected= - {:num_object_records=>9, - :num_heap_records=>2, - :last_update_objects_alive=>6, - :last_update_objects_dead=>10, - :last_update_objects_skipped=>3, - :last_update_objects_frozen=>3}>}> +:heap_recorder_snapshot => {:last_update_objects_alive=>7, :last_update_objects_dead=>9, :last_update_objects_frozen=>3, :last_update_objects_skipped=>3, :num_heap_records=>3, :num_object_records=>10}, +:serialization_failures => 0, +:serialization_successes => 1, +:serialization_time_ns_avg => 221980.0, +:serialization_time_ns_max => 221980, +:serialization_time_ns_min => 221980, +:serialization_time_ns_total => 221980, # ./spec/datadog/profiling/stack_recorder_spec.rb:1044:in `block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>' # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>' Finished in 24.11 seconds (files took 1.38 seconds to load) 738 examples, 1 failure, 20 pending Failed examples: rspec ./spec/datadog/profiling/stack_recorder_spec.rb:1014 # Datadog::Profiling::StackRecorder#stats with heap profiling enabled includes heap recorder snapshot Randomized with seed 48141 ``` This flaky spec is quite weird: this test creates a number of throwaway objects, and then calls `GC.start` expecting Ruby to garbage collect them. Because these objects are garbage collected, our heap profiler then also marks the objects as dead and cleans up its own internal tracking. So, what's keeping these objects alive? Apparently, nothing. I followed the same steps from https://bugs.ruby-lang.org/issues/19460#note-4 and there were no references active to the object! Specifically I added a `ObjectSpace.dump_all(output: File.open("dump-#{Process.pid}.json", "w+"))` AFTER the call to `GC.start`, and the object is still on the heap, even though there are no references to it. Thus our working theory is that this is similar to the Ruby discussion above: the Ruby GC is conservative, not precise, so somewhere there's some leftover piece of state (variable? register? that's causing Ruby to decide not to clean the object, even though there don't seem to be any references to it. As an additional datapoint, this issue seems to only happen on Ruby 3.3 in CI. Locally, I also could only reproduce it on Ruby 3.3 (I tried 3.3.4) and 3.4.0-preview1. Interestingly enough, I have debug builds of Ruby that are built with lower optimization and a lot of debug info and neither my 3.3 nor my 3.4 debug builds reproduced this issue, which perhaps suggests that some compiler optimization is leaving a reference to the object that hasn't been cleaned yet. For future reference, I can reproduce this using `rspec spec/datadog/profiling --seed 48141 --fail-fast` on top of `2fc03d5ae5860d4e9a75ce3825fba95ed288a1a2` (with the workaround removed). **Motivation:** Our usual goal of zero flaky profiler specs always applies. Fixes DataDog/ruby-guild#178 **Additional Notes:** I chose to create a hash instead of an object as this shows up clearly in the debug output -- e.g. you can immediately tell which is the object that wasn't supposed to still be alive. **How to test the change?** Verify that CI still passes.
1 parent 1402a32 commit 246d8cd

File tree

1 file changed

+15
-16
lines changed

1 file changed

+15
-16
lines changed

spec/datadog/profiling/stack_recorder_spec.rb

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1011,31 +1011,30 @@ def sample_allocation(obj)
10111011

10121012
it "includes heap recorder snapshot" do
10131013
live_objects = []
1014+
1015+
# NOTE: We've seen some flakiness in this spec on Ruby 3.3 when the `dead_heap_samples` were allocated after
1016+
# the `live_heap_samples`. Our working theory is that this is something like
1017+
# https://bugs.ruby-lang.org/issues/19460 biting us again (e.g. search for this URL on this file and you'll see
1018+
# a similar comment). Specifically, it looks like in some situations Ruby still keeps a reference to the last
1019+
# allocated object _somewhere_, which makes the GC not collect that object, even though there are no actual
1020+
# references to it. Because the GC doesn't clean the object, the heap recorder still keeps its record alive,
1021+
# and so the test causes flakiness.
1022+
# See also the discussion on commit 2fc03d5ae5860d4e9a75ce3825fba95ed288a1 for an earlier attempt at fixing this.
1023+
dead_heap_samples = 10
1024+
dead_heap_samples.times do |_i|
1025+
obj = {}
1026+
sample_allocation(obj)
1027+
end
1028+
10141029
live_heap_samples = 6
10151030
live_heap_samples.times do |i|
10161031
obj = Object.new
10171032
obj.freeze if i.odd? # Freeze every other object
10181033
sample_allocation(obj)
10191034
live_objects << obj
10201035
end
1021-
dead_heap_samples = 10
1022-
dead_heap_samples.times do |_i|
1023-
obj = Object.new
1024-
sample_allocation(obj)
1025-
end
10261036
GC.start # All dead objects above will be GCed, all living strings will have age = 0
10271037

1028-
# @ivoanjo: For some weird reason, the last object sampled in the "dead_heap_samples" does not always
1029-
# get collected the first time, leading to a flaky spec.
1030-
# I was able to reproduce it with `rspec spec/datadog/profiling --seed 48141 --fail-fast` and it's
1031-
# kinda bizarre since e.g. if you add one more `Object.new` it also stops flaking, so is it perhaps
1032-
# related to Ruby's conservative GC?
1033-
# I've bisected this and undoing 3d4b7fcf30b529b191ca737ae13629eb27b8ab63 also makes the flakiness
1034-
# go away, but again, that change doesn't seem to have anything to do with GC.
1035-
# As the weird behavior is transitory, e.g. it provably goes away on the next GC, I'll go with this
1036-
# workaround for now.
1037-
GC.start
1038-
10391038
begin
10401039
# Allocate some extra objects in a block with GC disabled and ask for a serialization
10411040
# to ensure these strings have age=0 by the time we try to serialize the profile

0 commit comments

Comments
 (0)