Skip to content

Conversation

xemul
Copy link
Contributor

@xemul xemul commented Sep 17, 2025

Whether or not to set RWF_NOWAIT flag for iocb is controlled by the nowait_works bit that travels a long journey from reactor option via fs_info, then file, then struct request (or more exactly -- one of its united sub-structures). Nowadays this bit is boolean, but we've seen that it might be ultimately broken for writes and work for reads. This PR turns the nowait_works bit on file and fs_info to be a tri-state enum. The request struct still carries it as bit (because they are independent for reads and writes) as well as reactor option remains boolean.

"By default" (i.e. -- on modern kernels with no --linux-aio-nowait option) the mode activates to read_only for XFS. As a result, reads are submitted with the flag and writes are submitted without it. With that we trade guaranteed write-retry for potential delay in the kernel submitting it. Hopefully, chances for the latter are low enough.

refs: #2974

The value is going to become more than yes/no switch and this patch
prepares for that. For now only two values -- yes and no.

Signed-off-by: Pavel Emelyanov <[email protected]>
Continuation of the previous patch -- now the filesystem is also
prepared to export is "nowait" facility as enum class.

Signed-off-by: Pavel Emelyanov <[email protected]>
This is the third state, which allows RWF_NOWAIT flag for read requests
only.

Signed-off-by: Pavel Emelyanov <[email protected]>
In the end of the day the nowait mode to use depends on three sources of
information: the filesystem itself, the kernel version and the reactor
option (CLI switch). By now the selection was between yes and no modes,
this patch wires the read_only mode detection and activation.

Signed-off-by: Pavel Emelyanov <[email protected]>
@xemul xemul requested a review from avikivity September 17, 2025 17:25
@xemul
Copy link
Contributor Author

xemul commented Sep 17, 2025

For the record:
Even after the kernel gets fixed (though we were asked to move longer route), the fix will probably only work for lazytime-mounted partitions, thus the decision how to submit writes -- via thread-pool or without the nowait flag -- will partially (see scylladb/scylladb#26002) remain open

@xemul
Copy link
Contributor Author

xemul commented Sep 29, 2025

@avikivity , please review

1 similar comment
@xemul
Copy link
Contributor Author

xemul commented Oct 9, 2025

@avikivity , please review

@avikivity
Copy link
Member

What's the goal? Why would one enable nowait for one and not the other?

Is it because we think nowait is misreported for writes but not reads?

@xemul
Copy link
Contributor Author

xemul commented Oct 9, 2025

It's not that someone enables or disables it, this "switch" is not controlled by any new option. The old option to enable/disable RWF_NOWAIT altogether doesn't change its behavior. The change here is only about automatic selection of whether or not seastar will use this bit on iocbs. And the suggestion is -- reads can be submitted with nowait and writes not. Or is your question -- why should we submit writes without nowait at all?

@avikivity
Copy link
Member

Yes, why the different behavior for reads and writes?

@xemul
Copy link
Contributor Author

xemul commented Oct 9, 2025

Because nowait writes have many chances to instantly EAGAIN and it's greatly out of our control. Major issue is cmtime update, someone (us?) have to fix the kernel. Two options

  1. lazytime vs nowait handling (patch)
  2. O_NOCMTIME flag (patch)

Both are turning to be rather long journey, then wait for the fixes to reach cloud instances. And seastar needs to live with it somehow

Other than cmtime update, XFS files are still append-challenged, truncating it ahead of writes doesn't help at all, extent-allocation-size-hint helps partially. But that's, probably, minor

@xemul
Copy link
Contributor Author

xemul commented Oct 13, 2025

Running a "restore from backup" test with #2975 and one more metrics.
Here are AIO statistics that shows 100% retry rate:

image

I also added a metrics to show the number of context switches [1]. For now it looks like this

image

but no comments about it yet. My plan is to re-run the very same test with explicit ban of RWF_NOWAIT for writes and check if context switches grow up.

[1]

--- a/src/core/reactor.cc
+++ b/src/core/reactor.cc
@@ -2543,6 +2543,11 @@ void reactor::register_metrics() {
             io_fallback_counter("file_operation", internal::thread_pool_submit_reason::file_operation),
             // total_operations value:DERIVE:0:U
             io_fallback_counter("process_operation", internal::thread_pool_submit_reason::process_operation),
+            sm::make_counter("ru_nvcsw", sm::description("Number of voluntary context switches"), [] {
+                struct ::rusage ru;
+                ::getrusage(RUSAGE_THREAD, &ru);
+                return ru.ru_nvcsw;
+            })
     });
 
     _metric_groups.add_group("memory", {

@avikivity
Copy link
Member

Yes, but how does it help to disable RWF_NOWAIT? Instead of a retry, you'll stall the reactor.

I can understand this logic: the write EAGAINs are bogus and the request without RWF_NOWAIT will succeed without stalling. But is this the case?

@xemul
Copy link
Contributor Author

xemul commented Oct 13, 2025

Yes, but how does it help to disable RWF_NOWAIT? Instead of a retry, you'll stall the reactor.

I'm collecting the data to see if we really do. I agree that we trade guaranteed retry for potential stall though.

I can understand this logic: the write EAGAINs are bogus and the request without RWF_NOWAIT will succeed without stalling. But is this the case?

Almost. Request without RWF_NOWAIT will likely succeed without stalling, that's my point, and once I have some stats, I'll post it here.

And also even when cmtime update is fixed, we still need to live somehow until the fix propagates to cloud instances. Maybe the way to go in such cases is to route all writes into thread-pool retry right at once, without waiting for EAGAIN from the kernel?

@xemul
Copy link
Contributor Author

xemul commented Oct 13, 2025

With RWF_NOWAIT being explicitly cleared for write AIOs:

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants