rpc: sink_impl: batch sending and deletion of snd_buf:s #2980

bhalevy · 2025-09-15T13:31:09Z

This series first extends rpc sink_impl backpressure until snd_buf destroy
so that callers block until enough memory, held by outstanding snd_buf:s is freed.

In addition batching mechanisms are added to queue up snd_buf:s
while a send_loop is busy sending the previous batch possibly on a remote shard.

When done sending, the original buffers are queued again for batch
destroy and delete on their original shard.

The batching mechanisms avoid too-long task queues that were
caused by small messages being sent and destroyed individually
across shards.

Plus, the single send loop ensures in-order sending of the messages,
thus it simplifies the sink implementation that now no longer needs
to sequence messages and reorder them after in the submit_to task.

Fixes #2979
Refs scylladb/scylladb#24818

Copilot

Pull Request Overview

This PR fixes a backpressure mechanism issue in the RPC sink implementation where semaphore units were being released prematurely, allowing other shards to accumulate too many resources. The solution extends the semaphore units' lifetime to match the foreign_ptr by storing them in the snd_buf structure.

Adds semaphore_units field to snd_buf structure to extend its lifetime
Moves semaphore units assignment before the remote execution submission
Removes premature semaphore units release from the completion handler

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
include/seastar/rpc/rpc_types.hh	Adds semaphore.hh include and semaphore_units field to snd_buf struct
include/seastar/rpc/rpc_impl.hh	Moves semaphore units to snd_buf and removes premature release

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

include/seastar/rpc/rpc_types.hh

bhalevy · 2025-09-16T05:17:47Z

c8f1c1c: added comment documenting new snd_buf semaphore_units member

gleb-cloudius

Looks fine to me, but it would have been nice to reproduce the problem and see that there is an improvement.

avikivity · 2025-09-17T16:50:13Z

What about a test?

bhalevy · 2025-09-21T14:06:39Z

What about a test?

I am working on a test that reproduces an issue without this backpressure extension
but I wasn't able to produce any issue yet in release mode.

bhalevy · 2025-09-25T12:01:02Z

upd: I was able to reproduce Too long queue reports, as the one below:

WARN  2025-09-22 14:45:10,551 [shard  6:main] seastar - Too long queue accumulated for main (1180 tasks)
 1: N7seastar8internal17do_for_each_stateIN9__gnu_cxx17__normal_iteratorIPNS_16temporary_bufferIcEESt6vectorIS5_SaIS5_EEEESA_ZZNS_23loopback_data_sink_impl3putENS_3net6packetEENKUlRS9_E_clESE_EUlRS5_E_EE
 1: N7seastar12continuationINS_8internal22promise_base_with_typeIvEEZNS_3rpc10connection4sendENS4_7snd_bufESt8optionalINSt6chrono10time_pointINS_12lowres_clockENS8_8durationIlSt5ratioILl1ELl1000000000EEEEEEEPNS4_11cancellableEE3$_1ZNS_6futureIvE14then_impl_nrvoISJ_SL_EET0_OT_EUlOS3_RSJ_ONS_12future_stateINS1_9monostateEEEE_vEE
 484: N7seastar12continuationINS_8internal22promise_base_with_typeIvEEZZNS_3rpc9sink_implI10serializerJNS_13basic_sstringIcjLj15ELb1EEEEEclERKS8_ENUlNS_15semaphore_unitsINS_35semaphore_default_exception_factoryENSt6chrono3_V212steady_clockEEEE_clESH_EUlNS_6futureIvEEE_ZNSK_17then_wrapped_nrvoIvSL_EENS_8futurizeIT_E4typeEOT0_EUlOS3_RSL_ONS_12future_stateINS1_9monostateEEEE_vEE
 484: N7seastar17smp_message_queue15async_work_itemIZNS_11foreign_ptrISt10unique_ptrINS_3rpc7snd_bufESt14default_deleteIS5_EEE10destroy_onES8_jEUlvE_EE
 1: N7seastar17smp_message_queue15async_work_itemIZZZNS_23loopback_data_sink_impl3putENS_3net6packetEENKUlRSt6vectorINS_16temporary_bufferIcEESaIS7_EEE_clESA_ENKUlRS7_E_clESC_EUlvE_EE
 114: N7seastar12continuationINS_8internal22promise_base_with_typeIvEEZNS_6futureIvE16handle_exceptionIZZNS_7reactor17run_in_backgroundES5_EN3$_0clEvEUlNSt15__exception_ptr13exception_ptrEE_Qoooooosr3stdE16is_invocable_r_vINS4_IT_EETL0__SA_Eaaeqsr3stdE12tuple_size_vINSt11conditionalIXsr3stdE9is_same_vINS1_18future_stored_typeIJSC_EE4typeENS1_9monostateEEESt5tupleIJEESL_IJSJ_EEE4typeEELi0Esr3stdE16is_invocable_r_vIvSF_SA_Eaaeqsr3stdE12tuple_size_vISP_ELi1Esr3stdE16is_invocable_r_vISC_SF_SA_Eaagtsr3stdE12tuple_size_vISP_ELi1Esr3stdE16is_invocable_r_vISP_SF_SA_EEES5_OSC_EUlSQ_E_ZNS5_17then_wrapped_nrvoIS5_SR_EENS_8futurizeISC_E4typeEOT0_EUlOS3_RSR_ONS_12future_stateISK_EEE_vEE
 95: N7seastar17smp_message_queue15async_work_itemIZZNS_3rpc9sink_implI10serializerJNS_13basic_sstringIcjLj15ELb1EEEEEclERKS6_ENUlNS_15semaphore_unitsINS_35semaphore_default_exception_factoryENSt6chrono3_V212steady_clockEEEE_clESF_EUlvE_EE
b

Note also long queue for waiting for semaphore units.
That said, it was also reproduced with the change (although apparently less frequently), and I want to understand why exactly.

bhalevy · 2025-09-27T09:21:44Z

In 48452c0:

rpc: sink_impl: extend backpressure until snd_buf destroy
- Reduce concurrency when sending across shards to avoid too-long queues
reactor: add abort_on_too_long_task_queue option
- for testing
rpc: make sink::close noexcept
- for deferred_close
test: rpc_test: add test_rpc_stream_backpressure_across_shards
- tidied-up to focus on cross shard rpc_sink by streaming back to originator from rpc source handler (since source.make_sink() generates a cross-shard connection)

tests/unit/rpc_test.cc

avikivity · 2025-09-27T12:28:23Z

tests/unit/rpc_test.cc

+                        throw std::runtime_error(msg);
+                    }
+                    break;
+                }


This infinite loop is not easy to understand.

I'd expect something like

try { while (auto msg = source().get()) { ... } } catch ...

src/core/reactor.cc

include/seastar/core/reactor.hh

include/seastar/rpc/rpc_types.hh

include/seastar/rpc/rpc_impl.hh

avikivity · 2025-09-27T13:01:48Z

I have a feeling this is the wrong approach. For every message we send, we create a cross-shard set of tasks. We have to account for out-of-order messages due to task reordering (which @gleb-cloudius hates). But messages can be tiny (tombstone-only mutation fragments in ScyllaDB).

The producer (caller of sink_impl::operator()) will have no problem producing messages back-to-back.

Suggest this:

Add a buffer: vector of messages
Allow at most one cross-smp call at a time
When we are called with a new message, and the buffer is empty, and there are no active cross-smp calls, send it to the remote
Otherwise, append the new message to the buffer. Return a ready future if the buffer is small enough, or a non-ready future if it's too large.
When a cross-smp call returns: send the pending buffer over, and signal the non-ready future we returned in step 4 so the producer can produce again.

The goal here is to have the local producer producing into the local buffer, and send the local buffer batch over with a concurrency of 1. There is no reordering.

gleb-cloudius · 2025-09-28T09:17:39Z

Suggest this:

1. Add a buffer: vector of messages

2. Allow at most one cross-smp call at a time

3. When we are called with a new message, and the buffer is empty, and there are no active cross-smp calls, send it to the remote

4. Otherwise, append the new message to the buffer. Return a ready future if the buffer is small enough, or a non-ready future if it's too large.

5. When a cross-smp call returns: send the pending buffer over, and signal the non-ready future we returned in step 4 so the producer can produce again.

It may work, but it will not fix the problem @bhalevy tries to fix here. The problem is that the work submitted to another shard is not fully completed when cross-smp call returns since snd_buf's are freed asynchronously.

bhalevy · 2025-09-28T09:45:27Z

Suggest this:
1. Add a buffer: vector of messages

2. Allow at most one cross-smp call at a time

3. When we are called with a new message, and the buffer is empty, and there are no active cross-smp calls, send it to the remote

4. Otherwise, append the new message to the buffer. Return a ready future if the buffer is small enough, or a non-ready future if it's too large.

5. When a cross-smp call returns: send the pending buffer over, and signal the non-ready future we returned in step 4 so the producer can produce again.
It may work, but it will not fix the problem @bhalevy tries to fix here. The problem is that the work submitted to another shard is not fully completed when cross-smp call returns since snd_buf's are freed asynchronously.

The high level problem here is the long queue of cross-shard tasks.
So even if throttling is extended until the snd_buf:s are destroyed, we can still generate a long queue of tasks since each snd_buf is sent via submit_to to the connection shard (and then we need to deal with the aftermath of losing ordering using the hand-crafted sequencing and out_of_order_bufs.

What Avi suggests is managing the send queue using a single cross-shard task that would preserve the messages order.
It would still need to extend throttling until the snd_buf's are freed too though. I agree with that.

gleb-cloudius · 2025-09-28T09:52:16Z

Suggest this:
1. Add a buffer: vector of messages

2. Allow at most one cross-smp call at a time

3. When we are called with a new message, and the buffer is empty, and there are no active cross-smp calls, send it to the remote

4. Otherwise, append the new message to the buffer. Return a ready future if the buffer is small enough, or a non-ready future if it's too large.

5. When a cross-smp call returns: send the pending buffer over, and signal the non-ready future we returned in step 4 so the producer can produce again.
It may work, but it will not fix the problem @bhalevy tries to fix here. The problem is that the work submitted to another shard is not fully completed when cross-smp call returns since snd_buf's are freed asynchronously.
The high level problem here is the long queue of cross-shard tasks. So even if throttling is extended until the snd_buf:s are destroyed, we can still generate a long queue of tasks since each snd_buf is sent via submit_to to the connection shard (and then we need to deal with the aftermath of losing ordering using the hand-crafted sequencing and out_of_order_bufs.

What Avi suggests is managing the send queue using a single cross-shard task that would preserve the messages order. It would still need to extend throttling until the snd_buf's are freed too though. I agree with that.

May be I am wrong, but the report we saw were all about foreign_ptr destruction, not about tasks submitted from sink::op(). The later is limited by the semaphore while the former is not. But I see how a lot of very small message can create a lot of submit_to calls.

avikivity · 2025-09-28T11:21:54Z

Suggest this:
1. Add a buffer: vector of messages

2. Allow at most one cross-smp call at a time

3. When we are called with a new message, and the buffer is empty, and there are no active cross-smp calls, send it to the remote

4. Otherwise, append the new message to the buffer. Return a ready future if the buffer is small enough, or a non-ready future if it's too large.

5. When a cross-smp call returns: send the pending buffer over, and signal the non-ready future we returned in step 4 so the producer can produce again.
It may work, but it will not fix the problem @bhalevy tries to fix here. The problem is that the work submitted to another shard is not fully completed when cross-smp call returns since snd_buf's are freed asynchronously.

The work would be reduced by a large amount since cross-smp calls would happen for entire batches (in both directions).

gleb-cloudius · 2025-09-28T11:24:19Z

The work would be reduced by a large amount since cross-smp calls would happen for entire batches (in both directions).

I do not see what will batch foregin_prt destructors.

avikivity · 2025-09-28T11:30:00Z

We could wrap the vector with a foreign_ptr rather than individual snd_bufs. Though it may be hard to keep them in the vector.

Alternatively, detach them from the vector, then collect them again after use.

Note the smp call that sends the vector blocks until it's processed, in order to let the tcp listener accumulate a new batch.

gleb-cloudius · 2025-09-28T11:51:57Z

snd_buf is moved around. Hard to collect. We can create a fancy deleter that tries to collect them, but the point is that the sender should wait for them to be deleted, not just sent otherwise they may accumulate.

Why do we suddenly care so much about cross shard streaming which should not be happening in normal circumstances? Is this because of tablet brokenness that does not preserve shard locality? I will kill our performance in many other places as well.

avikivity · 2025-09-28T13:26:14Z

It's not only tablets, it's also mixed node that sometimes happens.

Eventually we need to full to full mesh (shard-to-shard) but we can't do that with TCP. I'd like to see RPC over QUIC.

avikivity · 2025-09-28T13:29:55Z

            auto ret_fut = con->send(std::move(local_data), {}, nullptr);

This becomes a loop over the vector, no? We can make ret_fut return local_data.

bhalevy · 2025-09-30T13:24:09Z

030f127: fixed header dependency issue

bhalevy · 2025-09-30T13:49:13Z

Alpine Linux / build-and-test fails with rpc_test timeout.
See https://github.com/scylladb/seastar/actions/runs/18129947086/job/51593922425?pr=2980

But I'm not sure how to reproduce it and get to the bottom of it.
It doesn't reproduce for me locally.

bhalevy · 2025-09-30T13:52:58Z

ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field

what are these?

WARN  2025-09-30 12:47:32,783 [shard 0:main] seastar - Exceptional future ignored: St9bad_alloc (std::bad_alloc), backtrace: 
   --------
   seastar::smp_message_queue::async_work_item<seastar::rpc::server::connection::deregister_this_stream()::{lambda()#1}>
ERROR 2025-09-30 12:47:34,662 [shard 0:main] seastar - Timer callback failed: St9bad_alloc (std::bad_alloc)

eh?

bhalevy · 2025-10-02T06:09:03Z

874520d:

rebased
used boost::intrusive singly-linked list instead of doubly-linked list to reduce the per-buffer footprint to a single pointer
- this is possible since the lists are pure FIFOs and entries are always queued at the back and consumed from the front.

bhalevy · 2025-10-02T06:51:46Z

ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field
ninja: corrupt build log: missing field

what are these?

WARN  2025-09-30 12:47:32,783 [shard 0:main] seastar - Exceptional future ignored: St9bad_alloc (std::bad_alloc), backtrace: 
   --------
   seastar::smp_message_queue::async_work_item<seastar::rpc::server::connection::deregister_this_stream()::{lambda()#1}>
ERROR 2025-09-30 12:47:34,662 [shard 0:main] seastar - Timer callback failed: St9bad_alloc (std::bad_alloc)

eh?

@avikivity @xemul I don't know how to reproduce this failure, which seems persistent.
Who knows more about this Alpine Linux / build_and_test test?

avikivity · 2025-10-02T10:09:17Z

There's some persistent ccache state. I'll try to delete it.

avikivity · 2025-10-02T13:20:58Z

There's some persistent ccache state. I'll try to delete it.

Apparently the build recovered from it. It failed on the rpc unit test.

bhalevy · 2025-10-05T08:57:43Z

1d9dc73:

no need to send empty string to indicate end-of-stream.
- rely on disengaged option from source().get()
- verify singular EOS

bhalevy · 2025-10-05T14:16:55Z

@avikivity @xemul now the Alpine Linux build_and_test job failed in scheduling_group_nesting_test.
I don't know what's special about this test environment that yields those failures.

include/seastar/rpc/rpc.hh

avikivity · 2025-10-09T11:47:13Z

include/seastar/rpc/rpc.hh

+
+    static void destroy_and_delete(snd_buf* obj_ptr) {
+        obj_ptr->~snd_buf();
+        std::default_delete<snd_buf>()(obj_ptr);


Isn't this delete obj_ptr?

And you just called the destructor! confused.

I did that since it was allocated using std::unique_ptr.
If we'd use new to allocate it, then delete would be appropriate here.
I thought of this too after submitting this version.
I think using new and delete makes more sense in this context.
We just need to be careful to delete the snd_buf on exception in sink_impl::operator(), which is achieved today since were using std::unique_ptr

Used delete in 7bd9b0b

avikivity · 2025-10-09T11:49:07Z

include/seastar/rpc/rpc.hh

-    } _remote_state;
+    sink_impl_remote_state _remote_state;
+
+    bi::slist<snd_buf, bi::constant_time_size<false>, bi::cache_last<true>> _queue;


It may be better to use a vector.

The problem is transferring the vector across shards so it would probably need to be a foreign_ptr<std::unique_ptr<std::vector>> which isn't too bad. Otherwise, maybe we could reuse the vector allocation for queuing the exhausted snd_buf:s for destruction, but it is really ugly since the production rate isn't the same as the consumption rate - i.e. when we process a batch of snd_buf:s they are consumed by the socket and eventually the respective deleter is destroyed and it queues the snd_buf back for destruction, but I don't want to wait for the whole batch to complete and refill the same vector, so the foreign vector used by the sink shard to queue the outgoing messages can be destroyed when full consumed, and a foreign vector can be allocation by the connection shard to queue up exhausted snd_buf:s for destruction, and then be destroyed on the sink shard when the destroy batch completes.

So all in all, I think that an intrusive slist could be simpler and more efficient than a vector.

ok, no need for foreign_ptr. See 7bd9b0b

avikivity · 2025-10-09T11:52:07Z

include/seastar/rpc/rpc_impl.hh

+    return do_until([this] { return _queue.empty(); }, [this] {
+        // Grab the current queue and send it on the remote shard
+        // It can be moved safely across shards as it uses an intrusive list
+        auto batch = std::move(_queue);


std::exchange() is better here since the clearing of the current queue is part of the job, not a side effect.

Can do, though std::moving a vector practically clears it.
std::exchange makes it more explicit.

Used std::swap in 7bd9b0b
But for a slightly different purpose :)
There it may help reduce allocations for tx bursts

src/rpc/rpc.cc

avikivity · 2025-10-09T11:59:47Z

While complicated, looks good.

Worth to test with scylladb mutation streaming and tiny mutations, with cross-shard streams, and observe the reduction in cross-shard calls.

Gleb wrote: > backpresure mechanism in sink_impl<Serializer, Out...>::operator() > does not work as expected. It uses semaphore _sem to limit the > amount of work on the other shard but units are released before > foreign_ptr is freed, so another shard may accumulate a lot of them. > The solution as I see it is to make semaphore_units lifetime to be > the same as foreign_ptr (by holding it in the snd_buf for instance). Fixes scylladb#2979 Refs scylladb/scylladb#24818 Signed-off-by: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>

Define snd_buf_batched_queue that is used for batch processing of snd_buf:s on a remote shard. It id used first to queue up buffers on the send path where a send loop is invoked on the connection shard to send queued batches of snd_buf:s. Then, on the completion path, the exhausted buffers are queued for deletion on the delete queue, where the processing loop is invoked back on the sink shard to delete the buffers. Both changes avoid too long task queues that may be caused by sending small messages across shards. Note that snd_buf_batched_queue ensures processing of the buffers in fifo order also across shards, so the equence_number mechanism previously used to reorder out-of-order continuations was dropped. Signed-off-by: Benny Halevy <[email protected]>

Make sure any errors are returned as an exceptional future rather be thrown as exceptions. With that, close can be easily used to auto-close the sink using deferred_close. Signed-off-by: Benny Halevy <[email protected]>

Aborts using on_fatal_internal_error when the task queue grows too long (over the configured max_task_backlog which is 1000 by default). This is useful mostly for tests that may trigger too long queues and want to fail when that happens. Signed-off-by: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>

bhalevy · 2025-10-17T07:40:38Z

7bd9b0b:

define snd_buf_batched_queue that is based small_vector based
used both for batch processing on both the send and delete paths.

avikivity · 2025-10-21T22:35:29Z

@gleb-cloudius please review again

gleb-cloudius · 2025-10-22T10:38:02Z

include/seastar/rpc/rpc_impl.hh

    auto size = std::min(size_t(data.size), max_stream_buffers_memory);
-    const auto seq_num = _next_seq_num++;
-    return get_units(this->_sem, size).then([this, data = make_foreign(std::make_unique<snd_buf>(std::move(data))), seq_num] (semaphore_units<> su) mutable {
+    return get_units(this->_sem, size).then([this, data = std::make_unique<snd_buf>(std::move(data))] (semaphore_units<> su) mutable {


Why do you need to move data into a unique_ptr now?

gleb-cloudius · 2025-10-22T10:46:50Z

src/rpc/rpc.cc

+            return seastar::do_until([this] { return _cur_batch_pos == _cur_batch.end(); }, [this] {
+                auto* buf = *_cur_batch_pos;
+                ++_cur_batch_pos;
+                return _process_func(buf);


Why not co-routine and then you can just loop with for (auto i : _cur_batch)?

gleb-cloudius · 2025-10-22T10:51:23Z

src/rpc/rpc.cc

 // a deleter of a new buffer takes care of deleting the original buffer
-template<typename T> // T is either snd_buf or rcv_buf
-T make_shard_local_buffer_copy(foreign_ptr<std::unique_ptr<T>> org) {
+rcv_buf make_shard_local_buffer_copy(foreign_ptr<std::unique_ptr<rcv_buf>> org) {


Don't we have the same problem during receive?

gleb-cloudius · 2025-10-22T10:54:26Z

include/seastar/rpc/rpc.hh

+    {}
+
+    virtual ~snd_buf_deleter_impl() override {
+        _delete_queue.enqueue(_obj_ptr);


This may throw, no?

bhalevy mentioned this pull request Sep 15, 2025

rpc: batch destroy rpc buffers on owner shard #2968

Closed

bhalevy requested review from Copilot and gleb-cloudius September 15, 2025 13:32

Copilot AI reviewed Sep 15, 2025

View reviewed changes

include/seastar/rpc/rpc_types.hh Show resolved Hide resolved

bhalevy force-pushed the rpc-sink_impl-extend-backpressure-until-snd_buf-destroy branch from 0d42b97 to c8f1c1c Compare September 16, 2025 05:17

gleb-cloudius approved these changes Sep 16, 2025

View reviewed changes

bhalevy force-pushed the rpc-sink_impl-extend-backpressure-until-snd_buf-destroy branch from c8f1c1c to 48452c0 Compare September 27, 2025 09:16

bhalevy requested a review from avikivity September 27, 2025 09:21

avikivity requested a review from gleb-cloudius September 27, 2025 12:21