Skip to content

Improve filesystem timestamp granularity check #1899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 20, 2025

Conversation

EliahKagan
Copy link
Member

@EliahKagan EliahKagan commented Mar 20, 2025

Fixes #1896

This avoids false positives for the check of whether we can proceed to run the filesystem snapshot journey test, by making a hundred files about as fast as we can using ordinary non-parallelized techniques, then checking if all their timestamps are distinct.

This carries a small risk of false negatives, but experiments on macOS, where the journey test is typically able to run, suggest that the rate of false negatives will be low. (False positives are also, in theory, possible, but should be extremely rare or never happen.)

Distinct timestamp experiment

First, I experimented outside of CI on Ubuntu, Windows, and macOS, initially with the code shown in #1897 (review), but after that with this test program:

use std::io::Result;
use std::path::Path;
use std::time::{Instant, SystemTime};

fn main() -> Result<()> {
    let reps = 50;
    let n = 100;

    let mut uniques = Vec::new();
    for _ in 0..reps {
        uniques.push(run_experiment(n)?);
    }
    println!("{:?}", uniques);
    println!("{}, {}", uniques.contains(&1), uniques.contains(&n));
    Ok(())
}

fn run_experiment(n: u32) -> Result<u32> {
    let start = Instant::now();
    let tmp = tempfile::tempdir().unwrap();
    let times = unique_times(n, tmp.path())?;
    let finish = Instant::now();
    println!("{} / {}", times.len(), n);
    println!("{times:?}");
    println!("Experiment took {:?}.", finish - start);
    Ok(times.len().try_into().expect("it's out of n, which is u32"))
}

fn unique_times(n: u32, root: &Path) -> Result<Vec<SystemTime>> {
    let names: Vec<_> = (0..n).map(|i| format!("{i:03}")).collect();
    for name in &names {
        std::fs::write(root.join(name), name)?;
    }
    let mut times = Vec::new();
    for name in names {
        times.push(root.join(name).symlink_metadata()?.modified()?);
    }
    times.sort();
    times.dedup();
    Ok(times)
}

Running that locally, the last two lines of output on Ubuntu were:

[4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 4, 4, 5, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4, 3]
false, false

And on Windows (which has 100 ns representation granularity, and which uses a temporary directory that is not on a separate tmpfs-like filesystem):

[25, 25, 25, 24, 24, 25, 26, 25, 24, 24, 25, 26, 26, 24, 26, 25, 25, 24, 24, 23, 25, 25, 24, 23, 23, 24, 24, 26, 28, 26, 24, 23, 24, 23, 23, 23, 24, 23, 24, 25, 23, 23, 24, 24, 24, 30, 25, 26, 24, 23]                                        false, false

And on macOS, which has 1 ns granularity at least on the current default filesystem:

[100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]                                                                                                              false, true

For full details, including full output, see this repository.

This is not guaranteed to be that way on all such systems. The use of a different filesystem type or device type could potentially confer high granularity on a system that ordinarily doesn't have it, or result in diminished granularity on a system where it ordinarily is high. If it were guaranteed, then actually checking this would be overkill because it would be sufficient to enable the test on macOS but not other operating systems. That is expected to be what happens on CI as it is currently set up, as well as in most local environments, but not necessarily in all local environments. Therefore:

  • I think this is better than assuming which platforms work, though it might make sense to have CI fail if the check returns false on macOS.

  • Before considering this ready, I'm going to do that at least temporarily, to check that my non-CI macOS experiments are similar to CI (i.e. to make sure the test it least sometimes actually running). That's why this is still a draft.

    Edit: I've done that. The passing test with this macOS+CI specific assertion when repeating the whole test 250 times can be seen in this run.

  • Although I've marked this as fixing the issue, it might be better--even if merging this--to view it as a workaround in the hope of developing a better fix later, more along the lines of what you suggested in #1897 (comment). That way, the test could run on all platforms we test on CI, and most other platforms anyone would want to run it on, too.

    On the other hand, even then, it might be considered valuable to avoid sleeping during the test if the granularity is high enough that it isn't necessary. (I don't mean to avoid the delay, but rather with the idea that the test might be slightly more robust.) Also, the technique used here should be possible to adapt to infer the timestamp granularity.

CI experiment

The actual fix here is in d4a8d79, but I also changed the test so it runs 250 times instead of once (including separate calls to the helper function each time) before that commit, and changed it back afterwards. This was to make sure that the reason I didn't see failures was that this really does eliminate the failures (or at least make them extremely rare), rather than by chance, since the failures did not always happen before.

This workflow output on my fork shows that the tests passed even before it was changed back from running 250 times to one time. Please note also that this 250-times repetition is of the whole test--it is not related to the "inner" repetition of creating 100 files.

This is to smoke out nondeterministic failures on any platforms
that have them, so as to gain greater confidence that the
forthcoming fix is effective.

The two major matrix job definitions in `ci.yml` are accordingly
made so a failing job does not cancel the other jobs, so all jobs
that would fail can get as far as to show their failures. This is
likewise temporary.
This avoids false positives for the check of whether we can proceed
to run the filesystem snapshot `journey` test, by making a hundred
files about as fast as we can using ordinary non-parallelized
techniques, then checking if all their timestamps are distinct.

This carries a small risk of false negatives, but experiments on
macOS, where the `journey` test is typically able to run, suggest
that the rate of false negatives will be low. (False positive are
also, in theory, possible, but should be extremely rare or never
happen.)
This raises the risk of a failure when there is no bug, but
hopefully only slightly, and it's a simple way to verify that the
snapshot `journey` test really is running fully in at least one
regularlly tested platform.
@EliahKagan EliahKagan marked this pull request as ready for review March 20, 2025 00:23
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a million, this is great!

I have a few nits with one suggestion related to performance that may be worth more consideration.
Besides that, please feel free to merge when you think it's good to go.

- Decrease n=100 to n=50, which remains decisively enough on all
  platforms being tested (including Windows).

- Slightly streamline iteration by cloning the first iterator so
  its values need not be collected into a `Vec`.

- Make it so that if the CI+macOS high-granularity assertion ever
  fails, the message explains the significance of the failing
  `assert_eq!` comparison. (This condenses an added comment and
  makes it an assertion message instead.)

Suggested-by: Sebastian Thiel <[email protected]>
@EliahKagan
Copy link
Member Author

EliahKagan commented Mar 20, 2025

I'm waiting on tests in https://github.com/EliahKagan/gitoxide/actions/runs/13960418689, which verify that the technique still works under the changes (via outer repetition of the whole test, something we don't want to do at the tip of main and which is thus reverted in the next commit). Assuming that still goes well, I'll merge or enable auto-merge.

Edit: 🚀

@EliahKagan EliahKagan enabled auto-merge March 20, 2025 02:23
@EliahKagan EliahKagan merged commit 091d994 into GitoxideLabs:main Mar 20, 2025
21 checks passed
@EliahKagan EliahKagan deleted the run-ci/granularity branch March 20, 2025 02:26
@Byron
Copy link
Member

Byron commented Mar 20, 2025

Thanks again for all your help with this! It's huge issue that I introduced and couldn't easily fix, but you resolved it in the shortest possible time 🙏.

@EliahKagan
Copy link
Member Author

Thanks! I hope this will not dissuade you from replacing the approach added here with something that would let the test run everywhere, if that can be done without making it work any less well where it already can run. Or from encouraging me to do so, if you have thoughts about how, beyond (or in more detail on) the idea in #1897 (comment).

@Byron
Copy link
Member

Byron commented Mar 20, 2025

The test was added only because I strangely was missing one, but younger me might have predicted the difficulty in doing so and skipped it. Now that we went through the trouble, I am happy there is a test running at all, even though it won't do so right now on all platforms.
Thus, for now I think it's good enough and a clear improvement to what was (or wasn't) there before, so it's unlikely I get to work in it anytime soon.

Here is my gitoxide backlog, and that's just what I started but didn't finish. There is more bugs hidden in issues that I have to fix with priority.
Screenshot 2025-03-20 at 12 53 20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FileSnapshot::into_owned_or_cloned intermittently fails according to its journey test
2 participants