Skip to content

tests: fix race condition in start_conmon_with_default_args#648

Closed
jnovy wants to merge 1 commit intocontainers:mainfrom
jnovy:fix-race-condition-start-conmon
Closed

tests: fix race condition in start_conmon_with_default_args#648
jnovy wants to merge 1 commit intocontainers:mainfrom
jnovy:fix-race-condition-start-conmon

Conversation

@jnovy
Copy link
Collaborator

@jnovy jnovy commented Mar 16, 2026

Fix the race condition in start_conmon_with_default_args() in test/test_helper.bash.

Problem

When start_conmon_with_default_args is called a second time with --exec, the original container may have already transitioned to stopped state. The function only checked for running state before returning early, so a stopped container would fall through to wait_for_runtime_status "$CTR_ID" created — which would never succeed, resulting in a timeout.

PR #642 proposed masking this by increasing the timeout from 5s to 30s, but that does not address the root cause.

Fix

Add a check for stopped state alongside the existing running check, so both states cause an early return. This properly handles the case where the container has already finished by the time the exec conmon instance starts.

All 119 tests pass with this fix, including the 7 exec tests, 3 attach tests, and 19 ctrl tests that exercise the affected code path.

Related: #642

Signed-off-by: Jindrich Novy jnovy@redhat.com

@packit-as-a-service
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

1 similar comment
@packit-as-a-service
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@jnovy
Copy link
Collaborator Author

jnovy commented Mar 16, 2026

@ricardobranco777 PTAL

@ricardobranco777
Copy link
Contributor

@ricardobranco777 PTAL

Thanks for looking into this.

It passed on x86_64: https://openqa.opensuse.org/tests/5751790

On aarch64 we have this flake:
https://openqa-assets.opensuse.org/tests/5751789/file/conmon-conmon-runc-root.tap.txt

@ricardobranco777
Copy link
Contributor

ricardobranco777 commented Mar 16, 2026

On aarch64 we have this flake: https://openqa-assets.opensuse.org/tests/5751789/file/conmon-conmon-runc-root.tap.txt

When backporting #645 and this patch we get this on aarch64:

http://openqa-assets.opensuse.org/tests/5751813/file/conmon-conmon-runc-user.tap.txt
http://openqa-assets.opensuse.org/tests/5751813/file/conmon-conmon-runc-root.tap.txt

#| FAIL: /usr/bin/runc start failed with 1: time="2026-03-16T10:24:46-04:00" level=error msg="cannot start a container that has stopped"

Also:
http://openqa-assets.opensuse.org/tests/5751894/file/conmon-conmon-runc-root.tap.txt
#| FAIL: timed out waiting for 'running' from conmon-test-1773675527-34682-29452

@ricardobranco777
Copy link
Contributor

I gave up on #645

This PR doesn't seem to solve the issue on aarch64 as I'm getting these on ctr logs: journald partial message:

http://openqa-assets.opensuse.org/tests/5751931/file/conmon-conmon-runc-root.tap.txt

#| FAIL: timed out waiting for 'stopped' from conmon-test-1773678124-26964-30349

Fix two race conditions in start_conmon_with_default_args():

1. The previous fix checked for "running" or "stopped" state immediately
   after conmon returned, to handle --exec on already-running containers.
   However, on slow systems (e.g. aarch64), runc state can return a bogus
   "stopped" state (pid=0, created=epoch) during container creation,
   causing the function to return early without ever calling runc start.
   Fix: only skip the create/start flow when --exec is in the arguments,
   since that is the only case where the container is already managed.

2. On slow systems, there is a race condition where the container init
   process (created by runc create) may exit between the "created" state
   check and the runc start call. runc start then fails with "cannot
   start a container that has stopped". Fix: detect this specific error
   and treat it as success, since the container did run to completion.

Fixes the flaky test failures seen on aarch64 in openQA:
- https://openqa-assets.opensuse.org/tests/5751789/file/conmon-conmon-runc-root.tap.txt
- http://openqa-assets.opensuse.org/tests/5751813/file/conmon-conmon-runc-root.tap.txt
- http://openqa-assets.opensuse.org/tests/5751894/file/conmon-conmon-runc-root.tap.txt

Signed-off-by: Jindrich Novy <jnovy@redhat.com>
@jnovy jnovy force-pushed the fix-race-condition-start-conmon branch from 20e757c to 681a041 Compare March 16, 2026 16:37
@jnovy
Copy link
Collaborator Author

jnovy commented Mar 16, 2026

@ricardobranco777 Thanks for the detailed test results! I investigated the failures thoroughly and found two distinct race conditions:

Race Condition 1 (caused by our original fix):
On aarch64, runc state can return a bogus "stopped" state (with pid: 0 and created: "0001-01-01T00:00:00Z") during container creation - before runc create has fully completed. Our original blanket check for stopped state matched this transient state and returned early, causing runc start to never be called. The container then stayed stuck in created state forever.

Fix: Only skip the create/start flow when --exec is in the arguments, since that is the only legitimate case where the container is already managed by a previous call.

Race Condition 2 (pre-existing, exposed by PR #645):
On slow systems, the container init process (created by runc create) can exit between the wait_for_runtime_status created check and the runc start call. runc start then fails with "cannot start a container that has stopped". This was always possible but PR #645 made it visible by adding die on start failure.

Fix: Detect the specific "cannot start a container that has stopped" error from runc start and treat it as success, since the container did run to completion. Also fall back to checking current state via runc state.

The updated commit addresses both issues.

@jnovy
Copy link
Collaborator Author

jnovy commented Mar 16, 2026

@ricardobranco777 Seems we are chasing ghosts here. In fact this one is a runc regression fixed in runc 1.4.1 (released 2026-03-13) and was backported to 1.3.x branch of runc too - opencontainers/runc#5153

Can you retry with updated runc?

Copy link
Contributor

@ricardobranco777 ricardobranco777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricardobranco777 Seems we are chasing ghosts here. In fact this one is a runc regression fixed in runc 1.4.1 (released 2026-03-13) and was backported to 1.3.x branch of runc too - opencontainers/runc#5153

Can you retry with updated runc?

LGTM.

# stopped). In that case, we should not try to start it again.
local is_exec=false
for arg in "${extra_args[@]}"; do
if [[ "$arg" == "--exec" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to be pedantic, but no double-quotes needed for variables with [[. Also, == assumes a pattern. But this is harmless.

Suggested change
if [[ "$arg" == "--exec" ]]; then
if [[ $arg = "--exec" ]]; then

# and the start call. runc start then fails with "cannot start a
# container that has stopped". Check if this is the case and treat
# it as success since the container did run to completion.
if expr "$start_output" : ".*cannot start a container that has stopped" > /dev/null; then
Copy link
Contributor

@ricardobranco777 ricardobranco777 Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can replace all uses of expr with Bash's [[ just to avoid spawning new processes.

Suggested change
if expr "$start_output" : ".*cannot start a container that has stopped" > /dev/null; then
if [[ $start_output =~ ".*cannot start a container that has stopped" ]]; then

@jnovy
Copy link
Collaborator Author

jnovy commented Mar 16, 2026

If it runs for you with the new runc - even the old code should @ricardobranco777 - we don't need to merge this - it was a red herring.

@jnovy jnovy closed this Mar 16, 2026
@ricardobranco777
Copy link
Contributor

ricardobranco777 commented Mar 16, 2026

If it runs for you with the new runc - even the old code should @ricardobranco777 - we don't need to merge this - it was a red herring.

The above runs were done with this patch and runc v1.4.1.

@ricardobranco777
Copy link
Contributor

ricardobranco777 commented Mar 16, 2026

If it runs for you with the new runc - even the old code should @ricardobranco777 - we don't need to merge this - it was a red herring.

The above runs were done with this patch and runc v1.4.1.

FWIW, the tests pass without this patch and runc v1.4.1:

Thanks for helping with #642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants