rabbit_quorum_queue: Shrink batches of QQs in parallel#15081
rabbit_quorum_queue: Shrink batches of QQs in parallel#15081the-mikedavis wants to merge 1 commit intomainfrom
Conversation
|
With this change and the default |
|
This looks fine to me, at least for now. It would be quite possible to get much higher throughput on this and use command pipelining instead of spawning a bunch of processes just to exercise the WAL more. We'd need to add that as an option to the Ra API however. |
|
Ah yeah, with pipelining we could use the WAL much more efficiently. That shouldn't be too bad to add to Ra - just a new function in I'm actually more worried about the In the meantime making this parallel seems like an easy improvement since we can continue using the |
f14957d to
a14595d
Compare
|
@kjnilsson do you have any more feedback on the updated version? |
a14595d to
ea57c83
Compare
| amqqueue:get_type(Q) == ?MODULE, | ||
| lists:member(Node, get_nodes(Q))]), | ||
| Parent = self(), | ||
| lists:flatten([begin |
There was a problem hiding this comment.
could you use ra_lib:partition_parallel/2|3 here?
There was a problem hiding this comment.
Not directly: shrink/2 returns the current or updated size of the cluster and that's used in the output of the rabbitmq-queues shrink command. With ra_lib:partition_parallel/2 we need to return a boolean so we can't add the size info. Implementation-wise though, this looks nearly the same.
ea57c83 to
b74999d
Compare
Shrinking a member node off of a QQ can be parallelized. The operation involves * removing the node from the QQ's cluster membership (appending a command to the log and committing it) with `ra:remove_member/3` * updating the metadata store to remove the member from the QQ type state with `rabbit_amqqueue:update/2` * deleting the queue data from the node with `ra:force_delete_server/2` if the node can be reached All of these operations are I/O bound. Updating the cluster membership and metadata store involves appending commands to those logs and replicating them. Writing commands to Ra synchronously in serial is fairly slow - sending many commands in parallel is much more efficient. By parallelizing these steps we can write larger chunks of commands to WAL(s). `ra:force_delete_server/2` benefits from parallelizing if the node being shrunk off is no longer reachable, for example in some hardware failures. The underlying `rpc:call/4` will attempt to auto-connect to the node and this can take some time to time out. By parallelizing this, each `rpc:call/4` reuses the same underlying distribution entry and all calls fail together once the connection fails to establish.
b74999d to
511692a
Compare
Shrinking a member node off of a QQ can be parallelized. The operation involves
ra:remove_member/3rabbit_amqqueue:update/2ra:force_delete_server/2if the node can be reachedAll of these operations are I/O bound. Updating the cluster membership and metadata store involves appending commands to those logs and replicating them. Writing commands to Ra synchronously in serial is fairly slow - sending many commands in parallel is much more efficient. By parallelizing these steps we can write larger chunks of commands to WAL(s).
ra:force_delete_server/2benefits from parallelizing if the node being shrunk off is no longer reachable, for example in some hardware failures. The underlyingrpc:call/4will attempt to auto-connect to the node and this can take some time to time out. By parallelizing this, eachrpc:call/4reuses the same underlying distribution entry and all calls fail together once the connection fails to establish.Discussed in #15057