rabbit_msg_store: use bounded timeout for GC stop during shutdown#15498
Draft
lukebakken wants to merge 2 commits intorabbitmq:mainfrom
Draft
rabbit_msg_store: use bounded timeout for GC stop during shutdown#15498lukebakken wants to merge 2 commits intorabbitmq:mainfrom
rabbit_msg_store: use bounded timeout for GC stop during shutdown#15498lukebakken wants to merge 2 commits intorabbitmq:mainfrom
Conversation
Add a test that demonstrates the current behavior: when the message store GC process is unresponsive during shutdown, the supervisor kills the msg_store process before it can write recovery files (file_summary.ets, msg_store_index.ets, clean.dot). This forces a full index rebuild on the next startup. The test suspends the GC process with sys:suspend, then terminates the msg_store via the supervisor while a spawned process kills it after 500ms (simulating the supervisor shutdown timeout). After restart, successfully_recovered_state returns false, confirming the unclean recovery. Also add rabbit_msg_store:gc_pid/1 to expose the GC pid for testing.
When the message store shuts down, its terminate callback calls rabbit_msg_store_gc:stop/1 with an infinity timeout. If the GC process is stuck (e.g. on disk I/O during compaction under disk pressure), terminate blocks until the supervisor kills the msg_store process. This prevents the recovery files (file_summary.ets, msg_store_index.ets, clean.dot) from being written, forcing a full index rebuild on the next startup. Add rabbit_msg_store_gc:stop/2 with a configurable timeout. In terminate, use a timeout derived from msg_store_shutdown_timeout minus a 60s margin. If the GC does not stop in time, kill it and proceed to write recovery files. This is safe because the unclean recovery path handles any inconsistency from a mid-operation GC kill, and no messages are lost. Update the test to verify that after the fix, the msg_store recovers cleanly (successfully_recovered_state returns true) even when the GC is unresponsive during shutdown.
Collaborator
Author
|
Leaving as draft until I can re-reproduce the issue without this fix, then verify this fix, in a "real" environment. Early reviews are welcome, of course 😸 |
Collaborator
|
@lhoguin can you please take a quick look? Thank you. |
Collaborator
Author
|
No hurry because I still am working on this PR and testing it. |
Contributor
|
Perhaps just change |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
rabbit_msg_storeshuts down, itsterminatecallback callsrabbit_msg_store_gc:stop/1with aninfinitytimeout. If the GC process is stuck on disk I/O (for example, mid-compaction with disk alarms flapping near the free space limit),terminateblocks indefinitely. After the supervisor's shutdown timeout expires (default 600s viamsg_store_shutdown_timeout), the supervisor kills the message store process.Because the process is killed, terminate never reaches the code that writes the recovery files (
file_summary.ets,msg_store_index.ets,clean.dot). On the next startup, the message store detects the missing recovery data and rebuilds indices from scratch by scanning all segment files on disk. For large message stores this rebuild can take a significant amount of time.We hit this in production on a broker under PerfTest load. The persistent message store for the
/vhost logged "Stopping message store" and then nothing for 10 minutes until the supervisor killed it with reasonkilled. The GC process must have been blocked on disk I/O while disk free space was hovering right at the 2 GiB limit. On restart, the store logged "rebuilding indices from scratch" despite the shutdown having been initiated gracefully viarabbitmqctl stop.Fix
Add
rabbit_msg_store_gc:stop/2which accepts a timeout. Interminate, derive a GC timeout frommsg_store_shutdown_timeoutminus a 60-second margin (minimum 5 seconds) to leave time for writing recovery files. If the GC does not respond within that window, kill it and proceed with theremaining shutdown steps.
Killing the GC mid-operation is safe with respect to message data:
compact_filecopies messages before updating the index, and the original data remains on disk until truncation. The code comments confirm: "it's OK if we crash at any point before we update the index because the old data is still there until we truncate."truncate_fileonly removes data that has already been compacted to earlier offsets.delete_fileonly deletes files with zero valid messages, enforced by assertions before the delete.The unclean recovery path (
build_index/3) rebuilds everything from the actual segment files on disk usingscan_file_for_valid_messages, so any inconsistency between the file summary and the on-disk state is handled. In the common case (GC killed before it modified the file summary ETS), therecovery files will be fully consistent and the next startup will recover cleanly without a rebuild.
Commits
The first commit adds a test that reproduces the issue by suspending the GC process and then killing the message store process, demonstrating that
successfully_recovered_statereturnsfalse. The second commit implements the fix and updates the test to verify clean recovery and message survival.