Skip to content

Comments

rabbit_msg_store: use bounded timeout for GC stop during shutdown#15498

Draft
lukebakken wants to merge 2 commits intorabbitmq:mainfrom
amazon-mq:fix/msg-store-gc-stop-timeout
Draft

rabbit_msg_store: use bounded timeout for GC stop during shutdown#15498
lukebakken wants to merge 2 commits intorabbitmq:mainfrom
amazon-mq:fix/msg-store-gc-stop-timeout

Conversation

@lukebakken
Copy link
Collaborator

Problem

When rabbit_msg_store shuts down, its terminate callback calls rabbit_msg_store_gc:stop/1 with an infinity timeout. If the GC process is stuck on disk I/O (for example, mid-compaction with disk alarms flapping near the free space limit), terminate blocks indefinitely. After the supervisor's shutdown timeout expires (default 600s via msg_store_shutdown_timeout), the supervisor kills the message store process.

Because the process is killed, terminate never reaches the code that writes the recovery files (file_summary.ets, msg_store_index.ets, clean.dot). On the next startup, the message store detects the missing recovery data and rebuilds indices from scratch by scanning all segment files on disk. For large message stores this rebuild can take a significant amount of time.

We hit this in production on a broker under PerfTest load. The persistent message store for the / vhost logged "Stopping message store" and then nothing for 10 minutes until the supervisor killed it with reason killed. The GC process must have been blocked on disk I/O while disk free space was hovering right at the 2 GiB limit. On restart, the store logged "rebuilding indices from scratch" despite the shutdown having been initiated gracefully via rabbitmqctl stop.

Fix

Add rabbit_msg_store_gc:stop/2 which accepts a timeout. In terminate, derive a GC timeout from msg_store_shutdown_timeout minus a 60-second margin (minimum 5 seconds) to leave time for writing recovery files. If the GC does not respond within that window, kill it and proceed with the
remaining shutdown steps.

Killing the GC mid-operation is safe with respect to message data:

  • compact_file copies messages before updating the index, and the original data remains on disk until truncation. The code comments confirm: "it's OK if we crash at any point before we update the index because the old data is still there until we truncate."
  • truncate_file only removes data that has already been compacted to earlier offsets.
  • delete_file only deletes files with zero valid messages, enforced by assertions before the delete.

The unclean recovery path (build_index/3) rebuilds everything from the actual segment files on disk using scan_file_for_valid_messages, so any inconsistency between the file summary and the on-disk state is handled. In the common case (GC killed before it modified the file summary ETS), the
recovery files will be fully consistent and the next startup will recover cleanly without a rebuild.

Commits

The first commit adds a test that reproduces the issue by suspending the GC process and then killing the message store process, demonstrating that successfully_recovered_state returns false. The second commit implements the fix and updates the test to verify clean recovery and message survival.

Add a test that demonstrates the current behavior: when the message
store GC process is unresponsive during shutdown, the supervisor
kills the msg_store process before it can write recovery files
(file_summary.ets, msg_store_index.ets, clean.dot). This forces a
full index rebuild on the next startup.

The test suspends the GC process with sys:suspend, then terminates
the msg_store via the supervisor while a spawned process kills it
after 500ms (simulating the supervisor shutdown timeout). After
restart, successfully_recovered_state returns false, confirming the
unclean recovery.

Also add rabbit_msg_store:gc_pid/1 to expose the GC pid for testing.
When the message store shuts down, its terminate callback calls
rabbit_msg_store_gc:stop/1 with an infinity timeout. If the GC
process is stuck (e.g. on disk I/O during compaction under disk
pressure), terminate blocks until the supervisor kills the msg_store
process. This prevents the recovery files (file_summary.ets,
msg_store_index.ets, clean.dot) from being written, forcing a full
index rebuild on the next startup.

Add rabbit_msg_store_gc:stop/2 with a configurable timeout. In
terminate, use a timeout derived from msg_store_shutdown_timeout
minus a 60s margin. If the GC does not stop in time, kill it and
proceed to write recovery files. This is safe because the unclean
recovery path handles any inconsistency from a mid-operation GC
kill, and no messages are lost.

Update the test to verify that after the fix, the msg_store recovers
cleanly (successfully_recovered_state returns true) even when the GC
is unresponsive during shutdown.
@lukebakken
Copy link
Collaborator Author

Leaving as draft until I can re-reproduce the issue without this fix, then verify this fix, in a "real" environment. Early reviews are welcome, of course 😸

@michaelklishin
Copy link
Collaborator

@lhoguin can you please take a quick look? Thank you.

@lukebakken
Copy link
Collaborator Author

No hurry because I still am working on this PR and testing it.

@lhoguin
Copy link
Contributor

lhoguin commented Feb 23, 2026

Perhaps just change rabbit_msg_store_gc to do exit(GcPid, shutdown) instead of sending a message. It'll stop faster for everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants