Skip to content

Conversation

@shaunrd0
Copy link
Member

This fixes a race condition when writing and reading metadata on separate threads that is observed only on local filesystems. As @teo-tsirpanis suggested in the story description of CORE-73, we temporarily use a .tmp extension when flushing array metadata to disk until the write is completed.


TYPE: BUG
DESC: Fix write / read metadata race on local filesystem.

@shaunrd0 shaunrd0 force-pushed the shaunreed/core-73-metadata-readerwriter-race-on-local-file-system-vfs branch from b885040 to 2b03211 Compare November 21, 2025 22:12

## Metadata File

Metadata files with the `.tmp` extension are used for local filesystems only, indicating the metadata is still being flushed to disk.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what others think of using a .part extension instead. Either way is fine by me so by all means we can stick with .tmp, just something that came to mind while reviewing this myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.tmp has an unambiguous meaning, but you can add another extension before it if you want.

Also, this is an implementation detail that should not be mentioned in the spec. If we haven't already, we should state that readers should ignore files with names that don't follow the respective format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I updated the format spec in 523230a (did not make changes to the extension)

uri_.join_path(constants::array_metadata_dir_name), &unfiltered_uris));
for (const auto& unfiltered_uri : unfiltered_uris) {
const auto& uri = unfiltered_uri.to_string();
if (uri.ends_with(constants::temp_file_suffix)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check if the file does not have a timestamped name that ends in .tdb.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 523230a

Comment on lines 378 to 386
// Do not attempt to retrieve file size for temporary metadata files that
// are still flushing to disk.
const bool temp_metadata =
URI(abspath)
.parent_path()
.remove_trailing_slash()
.to_string()
.ends_with(constants::array_metadata_dir_name) &&
abspath.ends_with(constants::temp_file_suffix);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VFS is not an appropriate place to do such filtering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed in 523230a, it's purpose was to avoid opening the file to determine the size that was later thrown out by VFS::ls. 9b2e174 updated VFS::ls instead to avoid this by not gathering file sizes since they aren't needed.

Comment on lines +213 to +221
if (uri.is_file()) {
// For LocalFS use a .tmp extension until write is flushed.
URI metadata_uri(get_uri(uri).to_string() + ".tmp");
GenericTileIO::store_data(resources, metadata_uri, tile, encryption_key);
resources.vfs().move_file(metadata_uri, get_uri(uri));
} else {
// Create a metadata file name
GenericTileIO::store_data(resources, get_uri(uri), tile, encryption_key);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only metadata is susceptible to these race conditions; I can think of group members, array schema evolution, and array creation. It's OK if we want to prioritize metadata due to the increased impact.

I would have preferred to move this temporary file logic to the VFS, which would have required a pretty significant refactoring, but doing it this way is not very disruptive and OK for now.

stats_->add_counter("ls_num", 1);

for (auto& fs : ls_with_sizes(parent)) {
for (auto& fs : ls_with_sizes(parent, false)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this? It looks like a potential optimization (on POSIX only) that is unrelated to this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent us from reopening the file to determine the file size, since the writing thread will move the file when it's done flushing the metadata to disk causing a race condition when opening the array. This was the plan we discussed during sync this morning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's what it was about. Another option is to either report zero size or skip from listing, if the second open fails. Having to change all VFSes just for POSIX is IMO too much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to the LocalFilesystem base class so it only impacts windows / posix. We can do the try/catch pattern if you prefer though, let me know what you think.

Comment on lines +369 to +377
if (parent.is_file()) {
parallel_sort(
compute_tp_,
entries.begin(),
entries.end(),
[](const directory_entry& l, const directory_entry& r) {
return l.path().native() < r.path().native();
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pre-existing, but curious; is there code in this repo that assumes that paths are sorted? Sorting paths when listing is not a good idea in general, and precludes incrementally reading them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only code expecting sorted results that I'm aware of is test code. For example here's the CI run with the failures that this produced when I accidentally bypassed sorting in c2e1816

https://github.com/TileDB-Inc/TileDB/actions/runs/19717674222/job/56493579308#step:14:32027

Comment on lines 129 to 133
// Noop if the parent ` path ` is not a directory, do not error out.
if (!is_dir(URI(path))) {
return Status::Ok();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in ls_with_sizes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks! Fixed

@shaunrd0 shaunrd0 force-pushed the shaunreed/core-73-metadata-readerwriter-race-on-local-file-system-vfs branch from e56fd7e to e7c356b Compare December 1, 2025 22:15
Path canonicalization on Windows strips disallowed characters from the
filename, in this case `.` was stripped causing the failure.

https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions
…adata-readerwriter-race-on-local-file-system-vfs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants