-
Notifications
You must be signed in to change notification settings - Fork 200
Fix write / read metadata race on local filesystem. #5698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix write / read metadata race on local filesystem. #5698
Conversation
b885040 to
2b03211
Compare
format_spec/metadata.md
Outdated
|
|
||
| ## Metadata File | ||
|
|
||
| Metadata files with the `.tmp` extension are used for local filesystems only, indicating the metadata is still being flushed to disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what others think of using a .part extension instead. Either way is fine by me so by all means we can stick with .tmp, just something that came to mind while reviewing this myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.tmp has an unambiguous meaning, but you can add another extension before it if you want.
Also, this is an implementation detail that should not be mentioned in the spec. If we haven't already, we should state that readers should ignore files with names that don't follow the respective format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I updated the format spec in 523230a (did not make changes to the extension)
tiledb/sm/array/array_directory.cc
Outdated
| uri_.join_path(constants::array_metadata_dir_name), &unfiltered_uris)); | ||
| for (const auto& unfiltered_uri : unfiltered_uris) { | ||
| const auto& uri = unfiltered_uri.to_string(); | ||
| if (uri.ends_with(constants::temp_file_suffix)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check if the file does not have a timestamped name that ends in .tdb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 523230a
tiledb/sm/filesystem/posix.cc
Outdated
| // Do not attempt to retrieve file size for temporary metadata files that | ||
| // are still flushing to disk. | ||
| const bool temp_metadata = | ||
| URI(abspath) | ||
| .parent_path() | ||
| .remove_trailing_slash() | ||
| .to_string() | ||
| .ends_with(constants::array_metadata_dir_name) && | ||
| abspath.ends_with(constants::temp_file_suffix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VFS is not an appropriate place to do such filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (uri.is_file()) { | ||
| // For LocalFS use a .tmp extension until write is flushed. | ||
| URI metadata_uri(get_uri(uri).to_string() + ".tmp"); | ||
| GenericTileIO::store_data(resources, metadata_uri, tile, encryption_key); | ||
| resources.vfs().move_file(metadata_uri, get_uri(uri)); | ||
| } else { | ||
| // Create a metadata file name | ||
| GenericTileIO::store_data(resources, get_uri(uri), tile, encryption_key); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only metadata is susceptible to these race conditions; I can think of group members, array schema evolution, and array creation. It's OK if we want to prioritize metadata due to the increased impact.
I would have preferred to move this temporary file logic to the VFS, which would have required a pretty significant refactoring, but doing it this way is not very disruptive and OK for now.
tiledb/sm/filesystem/vfs.cc
Outdated
| stats_->add_counter("ls_num", 1); | ||
|
|
||
| for (auto& fs : ls_with_sizes(parent)) { | ||
| for (auto& fs : ls_with_sizes(parent, false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this? It looks like a potential optimization (on POSIX only) that is unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to prevent us from reopening the file to determine the file size, since the writing thread will move the file when it's done flushing the metadata to disk causing a race condition when opening the array. This was the plan we discussed during sync this morning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's what it was about. Another option is to either report zero size or skip from listing, if the second open fails. Having to change all VFSes just for POSIX is IMO too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this to the LocalFilesystem base class so it only impacts windows / posix. We can do the try/catch pattern if you prefer though, let me know what you think.
| if (parent.is_file()) { | ||
| parallel_sort( | ||
| compute_tp_, | ||
| entries.begin(), | ||
| entries.end(), | ||
| [](const directory_entry& l, const directory_entry& r) { | ||
| return l.path().native() < r.path().native(); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pre-existing, but curious; is there code in this repo that assumes that paths are sorted? Sorting paths when listing is not a good idea in general, and precludes incrementally reading them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only code expecting sorted results that I'm aware of is test code. For example here's the CI run with the failures that this produced when I accidentally bypassed sorting in c2e1816
https://github.com/TileDB-Inc/TileDB/actions/runs/19717674222/job/56493579308#step:14:32027
tiledb/sm/filesystem/local.cc
Outdated
| // Noop if the parent ` path ` is not a directory, do not error out. | ||
| if (!is_dir(URI(path))) { | ||
| return Status::Ok(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in ls_with_sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks! Fixed
This hits the retry loop sometimes, but needs to be more reliable.
VFS::ls is non-virtual and its signature varies across filesystems. This also allows us to use a single implementation instead of duplicating the logic.
This wasn't meant to make it into previous commits.
e56fd7e to
e7c356b
Compare
Path canonicalization on Windows strips disallowed characters from the filename, in this case `.` was stripped causing the failure. https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions
…adata-readerwriter-race-on-local-file-system-vfs
This fixes a race condition when writing and reading metadata on separate threads that is observed only on local filesystems. As @teo-tsirpanis suggested in the story description of CORE-73, we temporarily use a
.tmpextension when flushing array metadata to disk until the write is completed.TYPE: BUG
DESC: Fix write / read metadata race on local filesystem.