Skip to content

Fix check-mode.sh display of leading-whitespace paths #1709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 28, 2024

Conversation

EliahKagan
Copy link
Member

@EliahKagan EliahKagan commented Nov 28, 2024

In #1708, check-mode.sh deliberately parsed git ls-files -sz records using default nonempty IFS, to conveniently read fields into variables. But this would strip leading and trailing whitespace from the paths. Trailing whitespace is not currently present, since only files with .sh suffixes are processed, so the only paths that would be displayed incorrectly were those with leading whitespace.

(We do not intentionally have such files in the repository, but they can arise accidentally. They might also, in principle, be added at some point to support tests. They would not likely be kept, because they are incompatible with Windowws and git will not check them out there, but they could plausibly exist in some commits on a feature branch, even with no mistakes.1)

This was sort of okay, because the script never accesses the repository or filesystem using the paths, but only displays them to the user. But this was still bad for a few reasons:

  • The paths were at least in principle ambiguous.
  • A path with accidental leading whitespace would not be noticed.
  • The path variable looks like it would be usable, so it could be easily misused in future changes, if not fixed.
  • Having to think through whether or not this parsing bug is excessively serious, when reasoning about the code, may make the code more cumbersome to understand and maintain.

This is to say that I should never have done it in that way.

Therefore, this replaces the default-IFS logic with empty-IFS calls to read and regular-expression processing, using bash's [[ keyword, which supports a =~ operator that matches against a regular expression and populates the BASH_REMATCH array variable with captures.

(This approach to parsing git ls-file -sz is based on code I wrote in https://github.com/EliahKagan/gitscripts/blob/master/git-lsx.)

As discussed in #1708, this script may be converted to Rust. This change is not intended as a substitute for that. It should actually make that slightly easier, in that a regex approach has clearer semantics, whether it is translated into code that uses regex or not. (If regex are used, it will probably still need adjustment, as not all dialects recognize POSIX classes like [[:digit:]].)


I have tested this on Arch Linux, macOS, and Windows [edit: and again after 530948d], though I have not actually tested paths with leading whitespace on Windows, since they are not supported there and git will not check them out. (I have checked that it still works otherwise on all systems, though.)

For macOS, in addition to that verification, I've checked a bash 3.2 manual to verify that =~ and $BASH_REMATCH are present in that version, with the semantics I expect. [Edit: While =~ worked the way I expected, [[ itself didn't, in its interaction with set -e. That's fixed in 530948d.]

1 I mean, in addition to this feature branch, which does have such files in 2ba1ce3, removed in a9b3b48. (If you prefer to avoid that--maybe it would get in the way of some uses of git bisect on Windows?--then the twothree commits in this PR could be squashed into one or two.)

In GitoxideLabs#1708, `check-mode.sh` deliberately parsed `git ls-files -sz`
records using default nonempty `IFS`, to conveniently read fields
into variables. But this would strip leading and trailing
whitespace from the paths. Trailing whitespace is not currently
present, since only files with `.sh` suffixes are processed, so the
only paths that would be displayed incorrectly were those with
leading whitespace.

(We do not intentionally have such files in the repository, but
they can arise accidentally. They might also, in principle, be
added at some point to support tests. They would not likely be
kept, because they are incompatible with Windowws and `git` will
not check them out there, but they could plausibly exist in some
commits on a feature branch, even with no mistakes.)

This was *sort of* okay, because the script never accesses the
repository or filesystem using the paths, but only displays them to
the user. But this was still bad for a few reasons:

- The paths were at least in principle ambiguous.

- A path with accidental leading whitespace would not be noticed.

- The `path` variable looks like it would be usable, so it could be
  easily misused in future changes, if not fixed.

- Having to think through whether or not this parsing bug is
  excessively serious, when reasoning about the code, may make the
  code more cumbersome to understand and maintain.

This is to say that I should never have done it in that way.

Therefore, this replaces the default-`IFS` logic with empty-`IFS`
calls to `read` and regular-expression processing, using `bash`'s
`[[` keyword, which supports a `=~` operator that matches against a
regular expression and populates the `BASH_REMATCH` array variable
with captures.

(This approach to parsing `git ls-file -sz` is based on code I wrote
in https://github.com/EliahKagan/gitscripts/blob/master/git-lsx.)

As discussed in GitoxideLabs#1708, this script may be converted to Rust. This
change is not intended as a substitute for that. It should actually
make that slightly easier, in that a regex approach has clearer
semantics, whether it is translated into code that uses regex or
not. (If regex are used, it will probably still need adjustment,
as not all dialects recognize POSIX classes like `[[:digit:]]`.)

This commit includes some files to test the fix in the script. They
are at the top level of the repository, since the bug this fixes
only occurs with whitespace at the beginning of the first path
component (relative to the root of the working tree). These files
will be removed shortly. The messages for them are:

    mode +x but no shebang: $'\tb.sh'
    mode -x but has shebang: \ a.sh

Before the fix, they would have been shown as `a.sh` instead of
`\ a.sh`, and `b.sh` instead of `$'\tb.sh'`. Note that the approach
to quoting has not itself changed: the `%q` format specifier for
the `printf` builtin in `bash` formats them this, when the paths
are parsed in such a way that the characters that would need
quoting if used in the shell are not lost.
Now that the `etc/check-mode.sh` script is verified to detect them.
@EliahKagan EliahKagan marked this pull request as ready for review November 28, 2024 15:15
@EliahKagan EliahKagan marked this pull request as draft November 28, 2024 16:03
In bash 3.2, which is used on macOS, `[[` commands don't invoke
`set -e` behavior. This adds an explicit check.

This also improves some code that sliced an array instead of using
the variables its elements were already assigned to. One of them is
used twice, and indexing in the existing usage would make the
meaning less clear. So the variables would not be reasonable to
eliminate by slicing the array, without other refactoring.
@EliahKagan EliahKagan marked this pull request as ready for review November 28, 2024 16:20
@Byron
Copy link
Member

Byron commented Nov 28, 2024

Thanks a lot, and as always, I am impressed by the amount of diligence that went into this PR.

And since this PR was necessary even for somebody whom I consider super-human, this should probably provide all the motivation that's needed to rewrite this script as part of internal-tools sooner than later.

Doing so would have one disadvantage though: build times. But maybe with a shared-key for the cache one can share just the right caches across all jobs that use the internal tools, helping them start relatively quickly, albeit still slowly compared to a shell script. But ultimately, it should be worth it (and fun), even for a script that's probably never touched again.

@Byron Byron merged commit ee33221 into GitoxideLabs:main Nov 28, 2024
21 checks passed
@Byron
Copy link
Member

Byron commented Nov 28, 2024

PS: I am aware that a potential mine was just merged, with two files that seem to be displayed incorrectly in the GitHub UI even, without any quoting or hints about the whitespace they have in their name. But I suppose maybe one day it will help to uncover a shortcoming in a tool, which makes it worth it.

@EliahKagan
Copy link
Member Author

It will be interesting to see how it displays in the web-based interfaces of the Codeberg and gitee mirrors next time they are fast-forwarded. I am not very worried, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants