tests: fix race condition in start_conmon_with_default_args#648
tests: fix race condition in start_conmon_with_default_args#648jnovy wants to merge 1 commit intocontainers:mainfrom
Conversation
|
Ephemeral COPR build failed. @containers/packit-build please check. |
1 similar comment
|
Ephemeral COPR build failed. @containers/packit-build please check. |
|
@ricardobranco777 PTAL |
Thanks for looking into this. It passed on x86_64: https://openqa.opensuse.org/tests/5751790 On aarch64 we have this flake: |
When backporting #645 and this patch we get this on aarch64: http://openqa-assets.opensuse.org/tests/5751813/file/conmon-conmon-runc-user.tap.txt
Also: |
|
I gave up on #645 This PR doesn't seem to solve the issue on aarch64 as I'm getting these on http://openqa-assets.opensuse.org/tests/5751931/file/conmon-conmon-runc-root.tap.txt
|
Fix two race conditions in start_conmon_with_default_args(): 1. The previous fix checked for "running" or "stopped" state immediately after conmon returned, to handle --exec on already-running containers. However, on slow systems (e.g. aarch64), runc state can return a bogus "stopped" state (pid=0, created=epoch) during container creation, causing the function to return early without ever calling runc start. Fix: only skip the create/start flow when --exec is in the arguments, since that is the only case where the container is already managed. 2. On slow systems, there is a race condition where the container init process (created by runc create) may exit between the "created" state check and the runc start call. runc start then fails with "cannot start a container that has stopped". Fix: detect this specific error and treat it as success, since the container did run to completion. Fixes the flaky test failures seen on aarch64 in openQA: - https://openqa-assets.opensuse.org/tests/5751789/file/conmon-conmon-runc-root.tap.txt - http://openqa-assets.opensuse.org/tests/5751813/file/conmon-conmon-runc-root.tap.txt - http://openqa-assets.opensuse.org/tests/5751894/file/conmon-conmon-runc-root.tap.txt Signed-off-by: Jindrich Novy <jnovy@redhat.com>
20e757c to
681a041
Compare
|
@ricardobranco777 Thanks for the detailed test results! I investigated the failures thoroughly and found two distinct race conditions: Race Condition 1 (caused by our original fix): Fix: Only skip the create/start flow when Race Condition 2 (pre-existing, exposed by PR #645): Fix: Detect the specific The updated commit addresses both issues. |
|
@ricardobranco777 Seems we are chasing ghosts here. In fact this one is a Can you retry with updated |
ricardobranco777
left a comment
There was a problem hiding this comment.
@ricardobranco777 Seems we are chasing ghosts here. In fact this one is a
runcregression fixed inrunc1.4.1 (released 2026-03-13) and was backported to 1.3.x branch of runc too - opencontainers/runc#5153Can you retry with updated
runc?
LGTM.
| # stopped). In that case, we should not try to start it again. | ||
| local is_exec=false | ||
| for arg in "${extra_args[@]}"; do | ||
| if [[ "$arg" == "--exec" ]]; then |
There was a problem hiding this comment.
Not to be pedantic, but no double-quotes needed for variables with [[. Also, == assumes a pattern. But this is harmless.
| if [[ "$arg" == "--exec" ]]; then | |
| if [[ $arg = "--exec" ]]; then |
| # and the start call. runc start then fails with "cannot start a | ||
| # container that has stopped". Check if this is the case and treat | ||
| # it as success since the container did run to completion. | ||
| if expr "$start_output" : ".*cannot start a container that has stopped" > /dev/null; then |
There was a problem hiding this comment.
I believe we can replace all uses of expr with Bash's [[ just to avoid spawning new processes.
| if expr "$start_output" : ".*cannot start a container that has stopped" > /dev/null; then | |
| if [[ $start_output =~ ".*cannot start a container that has stopped" ]]; then |
|
If it runs for you with the new |
The above runs were done with this patch and runc v1.4.1. |
FWIW, the tests pass without this patch and runc v1.4.1:
Thanks for helping with #642 |
Fix the race condition in
start_conmon_with_default_args()intest/test_helper.bash.Problem
When
start_conmon_with_default_argsis called a second time with--exec, the original container may have already transitioned tostoppedstate. The function only checked forrunningstate before returning early, so astoppedcontainer would fall through towait_for_runtime_status "$CTR_ID" created— which would never succeed, resulting in a timeout.PR #642 proposed masking this by increasing the timeout from 5s to 30s, but that does not address the root cause.
Fix
Add a check for
stoppedstate alongside the existingrunningcheck, so both states cause an early return. This properly handles the case where the container has already finished by the time the exec conmon instance starts.All 119 tests pass with this fix, including the 7 exec tests, 3 attach tests, and 19 ctrl tests that exercise the affected code path.
Related: #642
Signed-off-by: Jindrich Novy jnovy@redhat.com