Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for mmap() writable mappings. #175

Merged
merged 14 commits into from
Feb 20, 2025
Merged

Support for mmap() writable mappings. #175

merged 14 commits into from
Feb 20, 2025

Conversation

aversecat
Copy link
Contributor

@aversecat aversecat commented Jun 11, 2024

Replaces #27, #39.

Contains mostly original patches from andy, touched up for conflicts. Additional fixups and changes to avoid various deadlocks and debug kernel warnings for lock contention issues.

  • - Does not pass xfstests:generic/346 - hard lockup in _mkwrite when doing update_inode
  • - Occasionally fails offline-extent-waiting - when reverse staging, the first blocks of the file end up zeros, not the expected content
  • - Passes all other xfstests
  • - Added cross-node mmap consistency test. doesn't work on el7
  • - sparse warnings about ret not returning vm_fault_t
  • - _walk_inodes page fault safe
  • - _get_allocated_inos page fault safe
  • - fsstress hard lockup in generic/013

@aversecat aversecat added the enhancement New feature or request label Jun 11, 2024
@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

Copy link
Collaborator

@zabbo zabbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a WIP, we knew there'd be W to do :).

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@versity-github
Copy link

@aversecat aversecat changed the title mmap() tree **WIP** mmap() tree. Aug 5, 2024
@versity-github
Copy link

@versity-github
Copy link

@zabbo
Copy link
Collaborator

zabbo commented Dec 3, 2024

the -debug- failures are:

Thanks for enumerating these -- yeah, so, the same repeat offenders :/.

@aversecat aversecat force-pushed the auke/mmap branch 2 times, most recently from 8b5ba17 to 282fa84 Compare December 3, 2024 23:09
@aversecat
Copy link
Contributor Author

Hmmmm, not looking good in CI.

Both debug-el8 and debug-el94 appear(*) stuck on generic/346 which is holetest with mmap. Trying to reproduce myself now.

(*) they completed generic/343 but output is stuck there. No message indicating 346 actually started.

@aversecat
Copy link
Contributor Author

aversecat commented Dec 9, 2024

-debug- failures are:

  • el8: hung task timeout in large-fragmented-free, and orphan-inodes failure
  • el94: hung task timeout in large-fragmented-free, and orphan-inodes failure
  • el95: hung task timeout in large-fragmented-free, and this one:
--- golden/archive-light-cycle	2024-12-06 20:05:22.572655167 +0000
+++ /root/scoutfs/tests/results/output/archive-light-cycle	2024-12-07 03:12:08.082647728 +0000
@@ -4,6 +4,8 @@
 == round 1: create
 == round 1: online
 == round 1: verify
+/mnt/test.3/test/archive-light-cycle/dir/3/2-2 /dev/fd/63 differ: char 277401601, line 67726
+script pid 165368 failed: rc 1
 == round 1: release
 == round 1: offline
 == round 1: stage
archive-light-cycle output differs

@zabbo
Copy link
Collaborator

zabbo commented Jan 16, 2025

retest

bcrl and others added 14 commits January 23, 2025 14:28
Adds the required memory mapped ops struct and page fault handler
for reads.

Signed-off-by: Benjamin LaHaise <[email protected]>
Signed-off-by: Auke Kok <[email protected]>
Add support for writable MAP_SHARED mmap()ings.  Avoid issues with late
writepage()s building transactions by doing the block_write_begin() work in
scoutfs_data_page_mkwrite().  Ensure the page is marked dirty and prepared
for write, then let the VM complete the write when the page is flushed or
invalidated.

Signed-off-by: Benjamin LaHaise <[email protected]>
Signed-off-by: Auke Kok <[email protected]>
Two test programs are added. The run time is about 1min on my el7
instance.

The test script finishes up with a read/write mmap test on offline
extents to verify the data wait paths in those functions.

One program will perform vfs read/write and mmap read/write calls on
the same file from across 5 threads (mounts) repeatedly.  The goal
is to assure there are no locking issues between read/write paths.

The second test program performs consistency checking on a file that is
repeatedly written/read using memory maps and normal reads and writes,
and the content is verified after every operation.

Signed-off-by: Auke Kok <[email protected]>
Now that all of these should be passing, we enable all mmap() tests in
xfstests, and update the golden output with the new tests.

Signed-off-by: Auke Kok <[email protected]>
We merely trace exit values and position, and ignore length.

Because vm_fault_t is __bitwise, sparse will loudly complain about
a plain cast to u32, so we must __force (on el8). ret will be 512 in
normal cases.

Signed-off-by: Auke Kok <[email protected]>
These 2 sections of compat for readdir are wholly obsolete and can be
hard dropped, which restores the method to look like current upstream
code.

This was added in ddd1a4e.

Signed-off-by: Auke Kok <[email protected]>
Verify using xfs_io that readdir offsets match expected output.

Signed-off-by: Auke Kok <[email protected]>
dir_emit() will copy_to_user, which can pagefault. If this happens while
cluster locked, we could deadlock.

We use a single page to stage dir_emit data, and iterate between
fetching dirents while locked, and emitting them while not locked.

Signed-off-by: Auke Kok <[email protected]>
Now that we support mmap writes, at any point in time we could
pagefault and lock for writes. That means - just like readdir -
we can no longer lock and copy_to_user, since it also may page fault
and thus deadlock.

We statically allocate 32 extent entries on the stack and use
these to shuffle out fiemap entries at a time, locking and
unlocking around collecting and fiemap_fill_extent_next.

Signed-off-by: Auke Kok <[email protected]>
Similar to readdir and fiemap vfs methods, we can't copy to user while
holding cluster locks. The previous comment about it being safe no
longer applies, and this could deadlock.

Rewrite the loop to iterate and store entries in a page, then flush
the page contents while not holding a clusterlock.

Signed-off-by: Auke Kok <[email protected]>
Similar to fiemap, readdir and walk_inodes, this method could have
put_user during a page fault, causing potentially a deadlock.

Signed-off-by: Auke Kok <[email protected]>
While debugging a double unlock error we hit this condition and
debugging would have been a lot easier had we enforced this simple
constraint that we can't decrement the lock users count if it's
already 0.

Signed-off-by: Auke Kok <[email protected]>
We need to assure we're emitting dents with the proper position
and we already have them as part of our dent. The only caveat is
to increment ctx->pos once beyond the list to make sure the caller
doesn't call us once more.

Signed-off-by: Auke Kok <[email protected]>
@aversecat
Copy link
Contributor Author

@zabbo : Mixed push for easier "re-review":

Added e59a5f8 Which adds a readdir output test using xfs_io. The golden here was generated using main.

Added e9d1472 which fixes the output of dir_emit to properly emit using proper ctx->pos values for each entry. This then now passes the test.

We could merge as is, or we can let this run through tests once more and squash the fix into @aversecat
Avoid deadlock in _readdir() due to copy_to_user().

Either obviously is fine.

@zabbo
Copy link
Collaborator

zabbo commented Feb 6, 2025

There seem to be real failures in archive-light-cycle in the 9.5 debug run. Have we looked into this? Are we seeing this anywhere else?

== round 1: create
== round 1: online
== round 1: verify
+/mnt/test.1/test/archive-light-cycle/dir/1/2-1 /dev/fd/63 differ: char 238444545, line 58215
== round 1: release
== round 1: offline
== round 1: stage```

@aversecat
Copy link
Contributor Author

@zabbo I found this in a test run in a branch(#202) that has no mmap patches:

--- golden/archive-light-cycle	2025-02-13 18:55:06.532816042 +0000
+++ /root/scoutfs/tests/results/output/archive-light-cycle	2025-02-14 02:04:50.421626043 +0000
@@ -15,6 +15,7 @@
 == round 2: create
 == round 2: online
 == round 2: verify
+/mnt/test.4/test/archive-light-cycle/dir/4/2-1 /dev/fd/63 differ: char 134172673, line 32758
 == round 2: release
 == round 2: offline
 == round 2: stage
archive-light-cycle output differs
--- golden/block-stale-reads	2025-02-13 18:55:06.541816111 +0000
+++ /root/scoutfs/tests/results/output/block-stale-reads	2025-02-14 02:04:53.510648267 +0000
@@ -1,2 +1,2 @@
 == Issue scoutfs df to force block reads to trigger stale invalidation/retry
-counter block_cache_remove_stale changed
+counter block_cache_remove_stale didn't change
block-stale-reads output differs

So it seems this is a potential issue in main.

Here's the fail.log for that run: http://jenkins.vpn.versity.com:8080/job/ng-scoutfs-test-debug-el95/16/artifact/artifacts/results/fail.log/*view*/

@zabbo
Copy link
Collaborator

zabbo commented Feb 18, 2025

So it seems this is a potential issue in main.

So it would seem, and so far only 9.5 debug runs I think.

@aversecat
Copy link
Contributor Author

aversecat commented Feb 18, 2025

I checked 2 of the offsets, and they're suspiciously at 1byte beyond a 4k boundary:

$ printf "0x%08x\n" 238444545
0x0e366001
$ printf "0x%08x\n" 203247617
0x0c1d5001

@zabbo zabbo merged commit a4b5a25 into main Feb 20, 2025
4 of 6 checks passed
@zabbo zabbo deleted the auke/mmap branch February 20, 2025 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants