-
Notifications
You must be signed in to change notification settings - Fork 80
Commit cb9b2d0
pipe: fix SSH hang (again)
When cbfaeba (Cygwin: pipe: Fix incorrect write length in
raw_write(), 2024-11-06) fixed a bug where too-long writes could cause
segmentation faults, it not only left out crucial details from the
commit message, it also introduced a bug.
This manifests e.g. in the symptom where cloning large repositories in
Git for Windows via SSH "freeze" at random stages, as has been reported
in git-for-windows/git#5688 and in
git-for-windows/git#5682.
The bug in question was to use `is_nonblocking ()` instead of
`real_non_blocking_mode` in a newly introduced condition. If the commit
message had filled in those details, it would most likely have been much
easier to spot the bug before the patch was committed.
Granted, the patch _sort of_ moved this `is_nonblocking ()` condition
from another place, which means that the bug was present before, even if
it did not have the same detrimental symptom of hanging a clone of a
large repository via SSH.
The _original_ bug was introduced in the equally under-explained
7ed9adb (Cygwin: pipe: Switch pipe mode to blocking mode by default,
2024-09-05).
What is ironic is that this here patch is the latest (and hopefully
last) commit in a _long_ chain of bug fixes that fix bugs introduced by
preceding bug fixes:
- 9e4d308 (Cygwin: pipe: Adopt FILE_SYNCHRONOUS_IO_NONALERT flag for
read pipe., 2021-11-10) fixed a bug where Cygwin hung by mistake while
piping output from one .NET program as input to another .NET program
(potentially introduced by 3651990 (Cygwin: pipe: Avoid false EOF
while reading output of C# programs., 2021-11-07), which was itself a
bug fix). It introduced a bug that was fixed by...
- fc691d0 (Cygwin: pipe: Make sure to set read pipe non-blocking for
cygwin apps., 2024-03-11). Which introduced a bug that was purportedly
fixed by...
- 7ed9adb (Cygwin: pipe: Switch pipe mode to blocking mode by
default, 2024-09-05). Which introduced a bug that was fixed by...
- cbfaeba (Cygwin: pipe: Fix incorrect write length in raw_write(),
2024-11-06). Which introduced a bug that was fixed by... this here
patch.
There is not only the common thread here that each of these bug fixes
introduced a new bug, but also the common thread of commit messages that
leave the reader puzzled and the problem — as well as the solution —
under-explained. It is quite likely that, had the time and care been
spent to fully explain what is going on in the commit messages, this
would quite likely have triggered a re-review of the diff by the author.
In other words, writing a complete commit message could have increased
the insight and understanding of the problem and thereby prevented this
series of "fixes that introduce new bugs".
To avoid making the same mistake once again, here is my attempt at a
thorough explanation of what's going on, what is the problem, what
solutions were considered, and why the "winner" was chosen.
So let's start with an overview of the logic of the
`fhandler_pipe_fifo::raw_write ()` method.
This method implements the fundamental logic behind writing to any pipe
in Cygwin, whether it be blocking or not blocking, and takes as input
two variables, `ptr` and `len`, pointing to the data to be written.
First, this method determines a couple of initial conditions (such as
maybe waiting on a specific mutex, or determining whether the pipe had
been closed already, or whether it is in blocking or in non-blocking
mode). Then, this method determines how many bytes of data should be
written in each attempt (assigned to the variable `chunk`).
The amount of bytes already written is tracked as `nbytes`. While this
value is smaller than `len`, the central loop of this method is
executed.
Counterintuitively, this loop is necessary even in non-blocking mode, to
handle conditions like `STATUS_PENDING`.
Inside this loop body, a variety of conditions are handled, including
some that break out of that loop before `nbytes >= len`, e.g. when the
pipe is closed on the other end.
This loop body also contains special, complicated logic to handle the
case where the pipe buffer is not large enough to handle the current
`chunk`, trying to write even less bytes (this is guarded by the flag
`short_write_once`).
Unfortunately, the recent re-design of that logic decided to always set
the pipe to blocking (cf. the diff of ed9adb356 (Cygwin: pipe: Switch
pipe mode to blocking mode by default, 2024-09-05)). The rationale for
that decision was the desire to simplify the code, which came at the
expense of working well together with pipes that were created outside of
Cygwin and that are set to non-blocking mode.
To this end, the current implementation now _emulates_ non-blocking
behavior while keeping the actual pipe in blocking mode. This requires
knowing how much data can be written to the pipe without blocking at any
given moment. Because by knowing that amount, even in blocking mode a
write operation can be performed that does not block. This information
is obtained via `NtQueryInformationFile(FilePipeLocalInformation)` and
stored in the variable `avail`.
However, if for some reason it cannot be determined how much space is
available in the pipe, this non-blocking mode emulation would not work.
Therefore the pipe is set to genuine non-blocking mode in such cases,
and only in such cases. This minimizes the time window (but does not
eliminate it completely!) during which Cygwin does not work well
together with non-Cygwin processes via pipes.
It also shows that the original desire to simplify the code can not be
fulfilled using the current strategy, as there is now both code to
emulate non-blocking mode as well as code that handles genuinely
non-blocking mode.
This information whether non-blocking mode is merely emulated or
genuinely enabled, is tracked by the Boolean variable
`real_non_blocking_mode`.
At the beginning of the loop body, after determining how many bytes to
write (stored in `len1`), the `NtWriteFile()` is called. Since the
re-design to always use blocking mode (at least insofar possible), this
`NtWriteFile()` call is guarded by a condition so that it is called only
once in non-blocking mode.
Now, the logic for that is quite complex. First, `len1` is clamped to
`chunk` in blocking mode, otherwise to however many bytes are left to
write. Then, an _inner_ `while` loop runs as long as `len1` is greater
than 0. Inside that loop, `NtWriteFile()` is called _unless_ we'd try to
write more bytes than are available in emulated non-blocking mode.
It is the complex interplay between that clamping to `chunk` and the
condition when `NtWriteFile()` is skipped that is at the heart of this
bug.
But let's first see what happens in the case described above, where the
(Cygwin) SSH process gets stuck writing to the (non-Cygwin) Git process.
In the instances I observed, the big `while (nbytes < len)` loop is
started with the following initial conditions: `len` is 2097152, `chunk`
and `avail` are 1 and `real_non_blocking_mode` is 0.
When the outer loop is entered, `len1` is therefore also set to 2097152,
and at the start of the inner loop the condition `len1 > avail` holds
true. Therefore, the `NtWriteFile()` call is skipped already during the
first iteration, we run into the code that tries a short write (setting
`short_write_once` to 1), and ultimately fail and return from
`raw_write()` with an error.
Since not a single byte was written, the caller will try again at some
stage, failing again, over and over, and we're in an infinite loop.
The fix chosen in this here commit is to apply that clamping to `chunk`
_also_ in the emulated non-blocking mode. This lets the `NtWriteFile()`
call _not_ be skipped in non-emulated non-blocking mode, as was clearly
the intention.
An alternative patch that would have also addressed the symptom would be
to partially revert this hunk of cbfaeba (Cygwin: pipe: Fix incorrect
write length in raw_write(), 2024-11-06):
chunk = len;
- else if (is_nonblocking ())
- chunk = len = pipe_buf_size;
else
chunk = avail;
The logic changed such that `len` would no longer be clamped to the same
value as `chunk` in non-blocking mode. Reinstating that clamp in
non-blocking mode would also work around the issue, but it would be a
bit more heavy-handed than the chosen fix.
Yet another potential fix that was suggested (in
git-for-windows/git#5688 (comment))
would be to modify the condition when `NtWriteFile()` is skipped so as
to force it to _not_ be skipped when attempting a "short write", i.e.
when `short_write_once != 0`. This fix would have worked around the bug
successfully, but is fixing the bug at the wrong layer (even if it hints
at the fact that the `short_write_once` logic is ill-prepared for
non-blocking mode, but that's a topic for another patch).
Fixes: cbfaeba (Cygwin: pipe: Fix incorrect write length in raw_write(), 2024-11-06)
Co-authored-by: Takashi Yano <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>1 parent d973094 commit cb9b2d0Copy full SHA for cb9b2d0
File tree
Expand file treeCollapse file tree
1 file changed
+1
-1
lines changedFilter options
- winsup/cygwin/fhandler
Expand file treeCollapse file tree
1 file changed
+1
-1
lines changedwinsup/cygwin/fhandler/pipe.cc
Copy file name to clipboardExpand all lines: winsup/cygwin/fhandler/pipe.cc+1-1Lines changed: 1 addition & 1 deletion
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
561 | 561 |
| |
562 | 562 |
| |
563 | 563 |
| |
564 |
| - | |
| 564 | + | |
565 | 565 |
| |
566 | 566 |
| |
567 | 567 |
| |
|
0 commit comments