-
Notifications
You must be signed in to change notification settings - Fork 1.6k
rpc: sink_impl: batch sending and deletion of snd_buf:s #2980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
rpc: sink_impl: batch sending and deletion of snd_buf:s #2980
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a backpressure mechanism issue in the RPC sink implementation where semaphore units were being released prematurely, allowing other shards to accumulate too many resources. The solution extends the semaphore units' lifetime to match the foreign_ptr by storing them in the snd_buf structure.
- Adds semaphore_units field to snd_buf structure to extend its lifetime
- Moves semaphore units assignment before the remote execution submission
- Removes premature semaphore units release from the completion handler
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
include/seastar/rpc/rpc_types.hh | Adds semaphore.hh include and semaphore_units field to snd_buf struct |
include/seastar/rpc/rpc_impl.hh | Moves semaphore units to snd_buf and removes premature release |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
0d42b97
to
c8f1c1c
Compare
c8f1c1c: added comment documenting new snd_buf semaphore_units member |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, but it would have been nice to reproduce the problem and see that there is an improvement.
What about a test? |
I am working on a test that reproduces an issue without this backpressure extension |
upd: I was able to reproduce
Note also long queue for waiting for semaphore units. |
c8f1c1c
to
48452c0
Compare
In 48452c0:
|
throw std::runtime_error(msg); | ||
} | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This infinite loop is not easy to understand.
I'd expect something like
try {
while (auto msg = source().get()) {
...
}
} catch ...
I have a feeling this is the wrong approach. For every message we send, we create a cross-shard set of tasks. We have to account for out-of-order messages due to task reordering (which @gleb-cloudius hates). But messages can be tiny (tombstone-only mutation fragments in ScyllaDB). The producer (caller of sink_impl::operator()) will have no problem producing messages back-to-back. Suggest this:
The goal here is to have the local producer producing into the local buffer, and send the local buffer batch over with a concurrency of 1. There is no reordering. |
It may work, but it will not fix the problem @bhalevy tries to fix here. The problem is that the work submitted to another shard is not fully completed when cross-smp call returns since snd_buf's are freed asynchronously. |
The high level problem here is the long queue of cross-shard tasks. What Avi suggests is managing the send queue using a single cross-shard task that would preserve the messages order. |
May be I am wrong, but the report we saw were all about foreign_ptr destruction, not about tasks submitted from sink::op(). The later is limited by the semaphore while the former is not. But I see how a lot of very small message can create a lot of submit_to calls. |
The work would be reduced by a large amount since cross-smp calls would happen for entire batches (in both directions). |
I do not see what will batch |
We could wrap the vector with a foreign_ptr rather than individual snd_bufs. Though it may be hard to keep them in the vector. Alternatively, detach them from the vector, then collect them again after use. Note the smp call that sends the vector blocks until it's processed, in order to let the tcp listener accumulate a new batch. |
Why do we suddenly care so much about cross shard streaming which should not be happening in normal circumstances? Is this because of tablet brokenness that does not preserve shard locality? I will kill our performance in many other places as well. |
It's not only tablets, it's also mixed node that sometimes happens. Eventually we need to full to full mesh (shard-to-shard) but we can't do that with TCP. I'd like to see RPC over QUIC. |
auto ret_fut = con->send(std::move(local_data), {}, nullptr); This becomes a loop over the vector, no? We can make ret_fut return local_data. |
2f9fbc9
to
030f127
Compare
030f127: fixed header dependency issue |
Alpine Linux / build-and-test fails with rpc_test timeout. But I'm not sure how to reproduce it and get to the bottom of it. |
what are these?
eh? |
030f127
to
874520d
Compare
|
@avikivity @xemul I don't know how to reproduce this failure, which seems persistent. |
There's some persistent ccache state. I'll try to delete it. |
Apparently the build recovered from it. It failed on the rpc unit test. |
874520d
to
1d9dc73
Compare
|
@avikivity @xemul now the Alpine Linux build_and_test job failed in scheduling_group_nesting_test. |
include/seastar/rpc/rpc.hh
Outdated
|
||
static void destroy_and_delete(snd_buf* obj_ptr) { | ||
obj_ptr->~snd_buf(); | ||
std::default_delete<snd_buf>()(obj_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this delete obj_ptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you just called the destructor! confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that since it was allocated using std::unique_ptr.
If we'd use new
to allocate it, then delete
would be appropriate here.
I thought of this too after submitting this version.
I think using new and delete makes more sense in this context.
We just need to be careful to delete the snd_buf on exception in sink_impl::operator()
, which is achieved today since were using std::unique_ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used delete
in 7bd9b0b
include/seastar/rpc/rpc.hh
Outdated
} _remote_state; | ||
sink_impl_remote_state _remote_state; | ||
|
||
bi::slist<snd_buf, bi::constant_time_size<false>, bi::cache_last<true>> _queue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to use a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is transferring the vector across shards so it would probably need to be a foreign_ptr<std::unique_ptr<std::vector>>
which isn't too bad. Otherwise, maybe we could reuse the vector allocation for queuing the exhausted snd_buf:s for destruction, but it is really ugly since the production rate isn't the same as the consumption rate - i.e. when we process a batch of snd_buf:s they are consumed by the socket and eventually the respective deleter is destroyed and it queues the snd_buf back for destruction, but I don't want to wait for the whole batch to complete and refill the same vector, so the foreign vector used by the sink shard to queue the outgoing messages can be destroyed when full consumed, and a foreign vector can be allocation by the connection shard to queue up exhausted snd_buf:s for destruction, and then be destroyed on the sink shard when the destroy batch completes.
So all in all, I think that an intrusive slist could be simpler and more efficient than a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, no need for foreign_ptr. See 7bd9b0b
include/seastar/rpc/rpc_impl.hh
Outdated
return do_until([this] { return _queue.empty(); }, [this] { | ||
// Grab the current queue and send it on the remote shard | ||
// It can be moved safely across shards as it uses an intrusive list | ||
auto batch = std::move(_queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::exchange() is better here since the clearing of the current queue is part of the job, not a side effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do, though std::moving a vector practically clears it.
std::exchange makes it more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used std::swap in 7bd9b0b
But for a slightly different purpose :)
There it may help reduce allocations for tx bursts
While complicated, looks good. Worth to test with scylladb mutation streaming and tiny mutations, with cross-shard streams, and observe the reduction in cross-shard calls. |
Gleb wrote: > backpresure mechanism in sink_impl<Serializer, Out...>::operator() > does not work as expected. It uses semaphore _sem to limit the > amount of work on the other shard but units are released before > foreign_ptr is freed, so another shard may accumulate a lot of them. > The solution as I see it is to make semaphore_units lifetime to be > the same as foreign_ptr (by holding it in the snd_buf for instance). Fixes scylladb#2979 Refs scylladb/scylladb#24818 Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Define snd_buf_batched_queue that is used for batch processing of snd_buf:s on a remote shard. It id used first to queue up buffers on the send path where a send loop is invoked on the connection shard to send queued batches of snd_buf:s. Then, on the completion path, the exhausted buffers are queued for deletion on the delete queue, where the processing loop is invoked back on the sink shard to delete the buffers. Both changes avoid too long task queues that may be caused by sending small messages across shards. Note that snd_buf_batched_queue ensures processing of the buffers in fifo order also across shards, so the equence_number mechanism previously used to reorder out-of-order continuations was dropped. Signed-off-by: Benny Halevy <[email protected]>
Make sure any errors are returned as an exceptional future rather be thrown as exceptions. With that, close can be easily used to auto-close the sink using deferred_close. Signed-off-by: Benny Halevy <[email protected]>
Aborts using on_fatal_internal_error when the task queue grows too long (over the configured max_task_backlog which is 1000 by default). This is useful mostly for tests that may trigger too long queues and want to fail when that happens. Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
1d9dc73
to
7bd9b0b
Compare
|
@gleb-cloudius please review again |
auto size = std::min(size_t(data.size), max_stream_buffers_memory); | ||
const auto seq_num = _next_seq_num++; | ||
return get_units(this->_sem, size).then([this, data = make_foreign(std::make_unique<snd_buf>(std::move(data))), seq_num] (semaphore_units<> su) mutable { | ||
return get_units(this->_sem, size).then([this, data = std::make_unique<snd_buf>(std::move(data))] (semaphore_units<> su) mutable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to move data
into a unique_ptr
now?
return seastar::do_until([this] { return _cur_batch_pos == _cur_batch.end(); }, [this] { | ||
auto* buf = *_cur_batch_pos; | ||
++_cur_batch_pos; | ||
return _process_func(buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not co-routine and then you can just loop with for (auto i : _cur_batch)
?
// a deleter of a new buffer takes care of deleting the original buffer | ||
template<typename T> // T is either snd_buf or rcv_buf | ||
T make_shard_local_buffer_copy(foreign_ptr<std::unique_ptr<T>> org) { | ||
rcv_buf make_shard_local_buffer_copy(foreign_ptr<std::unique_ptr<rcv_buf>> org) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have the same problem during receive?
{} | ||
|
||
virtual ~snd_buf_deleter_impl() override { | ||
_delete_queue.enqueue(_obj_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may throw, no?
This series first extends rpc sink_impl backpressure until snd_buf destroy
so that callers block until enough memory, held by outstanding snd_buf:s is freed.
In addition batching mechanisms are added to queue up snd_buf:s
while a send_loop is busy sending the previous batch possibly on a remote shard.
When done sending, the original buffers are queued again for batch
destroy and delete on their original shard.
The batching mechanisms avoid too-long task queues that were
caused by small messages being sent and destroyed individually
across shards.
Plus, the single send loop ensures in-order sending of the messages,
thus it simplifies the sink implementation that now no longer needs
to sequence messages and reorder them after in the submit_to task.
Fixes #2979
Refs scylladb/scylladb#24818