Handle nowait support for reads and writes independently #2987

xemul · 2025-09-17T17:25:29Z

Whether or not to set RWF_NOWAIT flag for iocb is controlled by the nowait_works bit that travels a long journey from reactor option via fs_info, then file, then struct request (or more exactly -- one of its united sub-structures). Nowadays this bit is boolean, but we've seen that it might be ultimately broken for writes and work for reads. This PR turns the nowait_works bit on file and fs_info to be a tri-state enum. The request struct still carries it as bit (because they are independent for reads and writes) as well as reactor option remains boolean.

"By default" (i.e. -- on modern kernels with no --linux-aio-nowait option) the mode activates to read_only for XFS. As a result, reads are submitted with the flag and writes are submitted without it. With that we trade guaranteed write-retry for potential delay in the kernel submitting it. Hopefully, chances for the latter are low enough.

refs: #2974

The value is going to become more than yes/no switch and this patch prepares for that. For now only two values -- yes and no. Signed-off-by: Pavel Emelyanov <[email protected]>

Continuation of the previous patch -- now the filesystem is also prepared to export is "nowait" facility as enum class. Signed-off-by: Pavel Emelyanov <[email protected]>

This is the third state, which allows RWF_NOWAIT flag for read requests only. Signed-off-by: Pavel Emelyanov <[email protected]>

In the end of the day the nowait mode to use depends on three sources of information: the filesystem itself, the kernel version and the reactor option (CLI switch). By now the selection was between yes and no modes, this patch wires the read_only mode detection and activation. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul · 2025-09-17T17:30:38Z

For the record:
Even after the kernel gets fixed (though we were asked to move longer route), the fix will probably only work for lazytime-mounted partitions, thus the decision how to submit writes -- via thread-pool or without the nowait flag -- will partially (see scylladb/scylladb#26002) remain open

xemul · 2025-09-29T04:48:10Z

@avikivity , please review

xemul · 2025-10-09T04:57:44Z

@avikivity , please review

avikivity · 2025-10-09T10:43:25Z

What's the goal? Why would one enable nowait for one and not the other?

Is it because we think nowait is misreported for writes but not reads?

xemul · 2025-10-09T11:45:03Z

It's not that someone enables or disables it, this "switch" is not controlled by any new option. The old option to enable/disable RWF_NOWAIT altogether doesn't change its behavior. The change here is only about automatic selection of whether or not seastar will use this bit on iocbs. And the suggestion is -- reads can be submitted with nowait and writes not. Or is your question -- why should we submit writes without nowait at all?

avikivity · 2025-10-09T14:26:56Z

Yes, why the different behavior for reads and writes?

xemul · 2025-10-09T16:40:58Z

Because nowait writes have many chances to instantly EAGAIN and it's greatly out of our control. Major issue is cmtime update, someone (us?) have to fix the kernel. Two options

lazytime vs nowait handling (patch)
O_NOCMTIME flag (patch)

Both are turning to be rather long journey, then wait for the fixes to reach cloud instances. And seastar needs to live with it somehow

Other than cmtime update, XFS files are still append-challenged, truncating it ahead of writes doesn't help at all, extent-allocation-size-hint helps partially. But that's, probably, minor

xemul · 2025-10-13T08:16:13Z

Running a "restore from backup" test with #2975 and one more metrics.
Here are AIO statistics that shows 100% retry rate:

I also added a metrics to show the number of context switches [1]. For now it looks like this

but no comments about it yet. My plan is to re-run the very same test with explicit ban of RWF_NOWAIT for writes and check if context switches grow up.

[1]

--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -2543,6 +2543,11 @@ void reactor::register_metrics() {
             io_fallback_counter("file_operation", internal::thread_pool_submit_reason::file_operation),
             // total_operations value:DERIVE:0:U
             io_fallback_counter("process_operation", internal::thread_pool_submit_reason::process_operation),
+            sm::make_counter("ru_nvcsw", sm::description("Number of voluntary context switches"), [] {
+                struct ::rusage ru;
+                ::getrusage(RUSAGE_THREAD, &ru);
+                return ru.ru_nvcsw;
+            })
     });
 
     _metric_groups.add_group("memory", {

avikivity · 2025-10-13T09:41:17Z

Yes, but how does it help to disable RWF_NOWAIT? Instead of a retry, you'll stall the reactor.

I can understand this logic: the write EAGAINs are bogus and the request without RWF_NOWAIT will succeed without stalling. But is this the case?

xemul · 2025-10-13T10:06:07Z

Yes, but how does it help to disable RWF_NOWAIT? Instead of a retry, you'll stall the reactor.

I'm collecting the data to see if we really do. I agree that we trade guaranteed retry for potential stall though.

I can understand this logic: the write EAGAINs are bogus and the request without RWF_NOWAIT will succeed without stalling. But is this the case?

Almost. Request without RWF_NOWAIT will likely succeed without stalling, that's my point, and once I have some stats, I'll post it here.

And also even when cmtime update is fixed, we still need to live somehow until the fix propagates to cloud instances. Maybe the way to go in such cases is to route all writes into thread-pool retry right at once, without waiting for EAGAIN from the kernel?

xemul · 2025-10-13T10:32:43Z

With RWF_NOWAIT being explicitly cleared for write AIOs:

xemul added 4 commits September 17, 2025 20:06

file: Make nowait_works bit a enum class

b198a03

The value is going to become more than yes/no switch and this patch prepares for that. For now only two values -- yes and no. Signed-off-by: Pavel Emelyanov <[email protected]>

filesystem: Make nowait_works bit a enum class too

faf0e93

Continuation of the previous patch -- now the filesystem is also prepared to export is "nowait" facility as enum class. Signed-off-by: Pavel Emelyanov <[email protected]>

file: Introduce read-only nowait_mode

fe09fb4

This is the third state, which allows RWF_NOWAIT flag for read requests only. Signed-off-by: Pavel Emelyanov <[email protected]>

xemul requested a review from avikivity September 17, 2025 17:25

xemul force-pushed the master branch from 5b52717 to 8549271 Compare October 10, 2025 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle nowait support for reads and writes independently #2987

Handle nowait support for reads and writes independently #2987

Uh oh!

xemul commented Sep 17, 2025

Uh oh!

xemul commented Sep 17, 2025

Uh oh!

xemul commented Sep 29, 2025

Uh oh!

xemul commented Oct 9, 2025

Uh oh!

avikivity commented Oct 9, 2025

Uh oh!

xemul commented Oct 9, 2025

Uh oh!

avikivity commented Oct 9, 2025

Uh oh!

xemul commented Oct 9, 2025

Uh oh!

xemul commented Oct 13, 2025

Uh oh!

avikivity commented Oct 13, 2025

Uh oh!

xemul commented Oct 13, 2025

Uh oh!

xemul commented Oct 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Handle nowait support for reads and writes independently #2987

Are you sure you want to change the base?

Handle nowait support for reads and writes independently #2987

Uh oh!

Conversation

xemul commented Sep 17, 2025

Uh oh!

xemul commented Sep 17, 2025

Uh oh!

xemul commented Sep 29, 2025

Uh oh!

xemul commented Oct 9, 2025

Uh oh!

avikivity commented Oct 9, 2025

Uh oh!

xemul commented Oct 9, 2025

Uh oh!

avikivity commented Oct 9, 2025

Uh oh!

xemul commented Oct 9, 2025

Uh oh!

xemul commented Oct 13, 2025

Uh oh!

avikivity commented Oct 13, 2025

Uh oh!

xemul commented Oct 13, 2025

Uh oh!

xemul commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xemul commented Oct 13, 2025 •

edited

Loading