`rabbit_msg_store`: use bounded timeout for GC stop during shutdown by lukebakken · Pull Request #15498 · rabbitmq/rabbitmq-server

lukebakken · 2026-02-17T22:32:58Z

Problem

When rabbit_msg_store shuts down, its terminate callback calls rabbit_msg_store_gc:stop/1 with an infinity timeout. If the GC process is stuck on disk I/O (for example, mid-compaction with disk alarms flapping near the free space limit), terminate blocks indefinitely. After the supervisor's shutdown timeout expires (default 600s via msg_store_shutdown_timeout), the supervisor kills the message store process.

Because the process is killed, terminate never reaches the code that writes the recovery files (file_summary.ets, msg_store_index.ets, clean.dot). On the next startup, the message store detects the missing recovery data and rebuilds indices from scratch by scanning all segment files on disk. For large message stores this rebuild can take a significant amount of time.

We hit this in production on a broker under PerfTest load. The persistent message store for the / vhost logged "Stopping message store" and then nothing for 10 minutes until the supervisor killed it with reason killed. The GC process must have been blocked on disk I/O while disk free space was hovering right at the 2 GiB limit. On restart, the store logged "rebuilding indices from scratch" despite the shutdown having been initiated gracefully via rabbitmqctl stop.

Fix

Add rabbit_msg_store_gc:stop/2 which accepts a timeout. In terminate, derive a GC timeout from msg_store_shutdown_timeout minus a 60-second margin (minimum 5 seconds) to leave time for writing recovery files. If the GC does not respond within that window, kill it and proceed with the
remaining shutdown steps.

Killing the GC mid-operation is safe with respect to message data:

compact_file copies messages before updating the index, and the original data remains on disk until truncation. The code comments confirm: "it's OK if we crash at any point before we update the index because the old data is still there until we truncate."
truncate_file only removes data that has already been compacted to earlier offsets.
delete_file only deletes files with zero valid messages, enforced by assertions before the delete.

The unclean recovery path (build_index/3) rebuilds everything from the actual segment files on disk using scan_file_for_valid_messages, so any inconsistency between the file summary and the on-disk state is handled. In the common case (GC killed before it modified the file summary ETS), the
recovery files will be fully consistent and the next startup will recover cleanly without a rebuild.

Commits

The first commit adds a test that reproduces the issue by suspending the GC process and then killing the message store process, demonstrating that successfully_recovered_state returns false. The second commit implements the fix and updates the test to verify clean recovery and message survival.

Add a test that demonstrates the current behavior: when the message store GC process is unresponsive during shutdown, the supervisor kills the msg_store process before it can write recovery files (file_summary.ets, msg_store_index.ets, clean.dot). This forces a full index rebuild on the next startup. The test suspends the GC process with sys:suspend, then terminates the msg_store via the supervisor while a spawned process kills it after 500ms (simulating the supervisor shutdown timeout). After restart, successfully_recovered_state returns false, confirming the unclean recovery. Also add rabbit_msg_store:gc_pid/1 to expose the GC pid for testing.

When the message store shuts down, its terminate callback calls rabbit_msg_store_gc:stop/1 with an infinity timeout. If the GC process is stuck (e.g. on disk I/O during compaction under disk pressure), terminate blocks until the supervisor kills the msg_store process. This prevents the recovery files (file_summary.ets, msg_store_index.ets, clean.dot) from being written, forcing a full index rebuild on the next startup. Add rabbit_msg_store_gc:stop/2 with a configurable timeout. In terminate, use a timeout derived from msg_store_shutdown_timeout minus a 60s margin. If the GC does not stop in time, kill it and proceed to write recovery files. This is safe because the unclean recovery path handles any inconsistency from a mid-operation GC kill, and no messages are lost. Update the test to verify that after the fix, the msg_store recovers cleanly (successfully_recovered_state returns true) even when the GC is unresponsive during shutdown.

lukebakken · 2026-02-17T22:47:23Z

Leaving as draft until I can re-reproduce the issue without this fix, then verify this fix, in a "real" environment. Early reviews are welcome, of course 😸

michaelklishin · 2026-02-19T03:55:01Z

@lhoguin can you please take a quick look? Thank you.

lukebakken · 2026-02-19T15:51:27Z

No hurry because I still am working on this PR and testing it.

lhoguin · 2026-02-23T10:02:08Z

Perhaps just change rabbit_msg_store_gc to do exit(GcPid, shutdown) instead of sending a message. It'll stop faster for everyone.

lukebakken added 2 commits February 17, 2026 21:54

lukebakken requested review from lhoguin, michaelklishin and the-mikedavis February 17, 2026 22:32

lukebakken self-assigned this Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

`rabbit_msg_store`: use bounded timeout for GC stop during shutdown#15498

`rabbit_msg_store`: use bounded timeout for GC stop during shutdown#15498
lukebakken wants to merge 2 commits intorabbitmq:mainfrom
amazon-mq:fix/msg-store-gc-stop-timeout

lukebakken commented Feb 17, 2026

Uh oh!

lukebakken commented Feb 17, 2026

Uh oh!

michaelklishin commented Feb 19, 2026

Uh oh!

lukebakken commented Feb 19, 2026

Uh oh!

lhoguin commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

lukebakken commented Feb 17, 2026

Problem

Fix

Commits

Uh oh!

lukebakken commented Feb 17, 2026

Uh oh!

michaelklishin commented Feb 19, 2026

Uh oh!

lukebakken commented Feb 19, 2026

Uh oh!

lhoguin commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants