How well are `arrow` builds synchronized across test processes? #1839

EliahKagan · 2025-02-12T21:29:41Z

EliahKagan
Feb 12, 2025
Collaborator

I got a failure in overwriting_files_and_lone_directories_works on CI when fast-forwarding the main branch of my fork. It occurred in the test-fast macos-latest job. The failure appears intermittent and rare because it went away when the job was rerun, and because it seems like it arises from a strange race condition.

        FAIL [   1.057s] gix-worktree-state-tests::worktree state::checkout::overwriting_files_and_lone_directories_works
──── STDOUT:             gix-worktree-state-tests::worktree state::checkout::overwriting_files_and_lone_directories_works

running 1 test
test state::checkout::overwriting_files_and_lone_directories_works ... FAILED

failures:

failures:
    state::checkout::overwriting_files_and_lone_directories_works

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 14 filtered out; finished in 1.04s

──── STDERR:             gix-worktree-state-tests::worktree state::checkout::overwriting_files_and_lone_directories_works
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on build directory
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.30s
--: /Users/runner/work/gitoxide/gitoxide/target/debug/examples/arrow: No such file or directory
Error: Filter(Driver(Init(ProcessHandshake { source: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }), command: "/bin/sh" "-c" "/Users/runner/work/gitoxide/gitoxide/target/debug/examples/arrow process" "--" })))

I believe this is completely unrelated to #1373, even though the failing test case is the same. There, the problem was in the symlink probe. Here, the problem is in calling the arrow.rs example filter, which does not exist when called. The messages about the build blocking are normal, and often happen, at least when the test suite is run locally and I believe also on CI. But it looks like the multiple concurrent runs of cargo build -p=gix-filter --example arrow somehow result in the arrow executable temporarily being absent when called. When an attempt is made to run it, it's not there:

/Users/runner/work/gitoxide/gitoxide/target/debug/examples/arrow: No such file or directory

I'm not clear on how that would happen, though. I'm not sure if this is a bug in the test suite, or a bug in cargo or rustc, or some other condition.

overwriting_files_and_lone_directories_works is one of three tests that call setup_filter_pipeline, which calls driver_exe, which accesses DRIVER, which is a once_cell::sync::Lazy instance that runs cargo build -p=gix-filter --example arrow at most once per process.

Although this may help avoid unnecessary runs of that build command in some circumstances, it neither prevents nor typically decreases the likelihood of running that build command two or more times concurrently. This is because gitoxide primarily uses the nextest runner, which uses multiple test processes to run separate tests in parallel. But it seems to me that this shouldn't be a problem, because cargo does its own synchronization. This can be seen in:

    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on build directory
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.30s

So I don't understand why arrow is not there when an attempt is made to run it as a filter.

This does not seem to be due to problems with the directory from which the command is run, nor to any wrong assumptions about where the executable should be created. If it were, then this test would always or at least regularly fail, instead of hardly ever failing. In addition, I've verified locally (though on GNU/Linux, not macOS) that the arrow executable is in that location when built from the root of the workspace as well as from gix-worktree-state/tests.

Byron · 2025-02-13T06:35:10Z

Byron
Feb 13, 2025
Maintainer

Even though it's unclear how it can happen, I think it's clear that it must be cargo causing the temporary disappearance of the arrow executable. No test-code meddles with it from what I can tell.

My guess is that cargo processes that wait for the lock perform certain work on the binary despite no rebuild being necessary. What if it decides to relink it, and the linker has strange behaviour around unlinking the target binary first instead of moving it into place? Maybe even cargo is responsible for this, but in theory no such thing could happen if even a newly built/linked binary would be moved into position.

If this keeps happening or happens often enough to be annoying, I suppose the test-suite could use its own lock to prevent multiple concurrent cargo invocations even across binaries.

On top of that, or alternatively, I could imagine that it should be quite straightforward to write a test-program that ramps up the parallelism to reliably trigger this issue. Then it could be posted on the cargo issue tracker for a chance to be fixed.

CC @weihanglo in case they have thoughts.

6 replies

weihanglo Feb 14, 2025

Like, if one process was about to run the executable, and the other tried to modify/delete it. Would it be that it was just too fast and the copy-on-write has yet been done, so nothing found?

Byron Feb 15, 2025
Maintainer

The paragraph below the line is without substance, but was left so any of what follows makes sense.

Thanks for chiming in! I think the issue also occurs on linux where I'd think the replacement is done by a simple rename call.
With that said, I think there might be two issues stacked on top of each other:

Cargo thinks the binary has to be remade even though it was definitely just created a moment ago
Somehow this binary replacement is racy at least on Linux

If either of them wasn't present, I think it would work as expected.~~

EliahKagan Feb 15, 2025
Collaborator Author

Does it also happen on Linux? The failure I observed, described above, was in a macos-latest run of test-fast.

Byron Feb 16, 2025
Maintainer

Apologies, I completely messed this up. The discussion is about MacOS runners only, so it does appear to be a MacOS specific problem.

EliahKagan Feb 16, 2025
Collaborator Author

This happens rarely and I've possibly only observed it once. So I don't know if it's macOS-specific, or just that the probability of it happening is low. But the observed case is on macOS.

For some reason, this reminds me of the macOS-specific issue gitpython-developers/gitdb#115, even though that only happened with Docker and it was a "permission denied" error. The similarity is that they are both race conditions that have possible explanations in terms of the macOS implementation of an operation being surprisingly non-atomic. But the underlying operations seem to be different, so maybe there is no useful connection.

Uh oh!

How well are arrow builds synchronized across test processes? #1839

Uh oh!

Uh oh!

EliahKagan Feb 12, 2025 Collaborator

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

Byron Feb 13, 2025 Maintainer

Uh oh!

weihanglo Feb 14, 2025

Uh oh!

Uh oh!

Byron Feb 15, 2025 Maintainer

Uh oh!

EliahKagan Feb 15, 2025 Collaborator Author

Uh oh!

Byron Feb 16, 2025 Maintainer

Uh oh!

EliahKagan Feb 16, 2025 Collaborator Author

How well are `arrow` builds synchronized across test processes? #1839

EliahKagan
Feb 12, 2025
Collaborator

Replies: 1 comment 6 replies

Byron
Feb 13, 2025
Maintainer

Byron Feb 15, 2025
Maintainer

EliahKagan Feb 15, 2025
Collaborator Author

Byron Feb 16, 2025
Maintainer

EliahKagan Feb 16, 2025
Collaborator Author