-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run HealthCheck without saving the ExecSession
to the database
#25003
base: main
Are you sure you want to change the base?
Conversation
ac4e5b3
to
31172f4
Compare
ExecSession
to the database
31172f4
to
79dddb6
Compare
Ephemeral COPR build failed. @containers/packit-build please check. |
1 similar comment
Ephemeral COPR build failed. @containers/packit-build please check. |
a2e6f31
to
8434c76
Compare
8434c76
to
30e4aac
Compare
/packit retest-failed |
This PR should not merge until after
|
30e4aac
to
d7e9667
Compare
We're branched so this can be reviewed now |
test/system/220-healthcheck.bats
Outdated
--health-cmd "sleep 20; echo $msg" \ | ||
$IMAGE /home/podman/pause | ||
|
||
run_podman 1 healthcheck run $ctr & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work. This will check nothing at all.
You cannot run these function in the background as the asserts they make do not fail anything, it is always a good idea to check your test are failing when you write them, e.g. try to change the exit code and nothing will happen.
It must be something like
diff --git a/test/system/220-healthcheck.bats b/test/system/220-healthcheck.bats
index ba8b03dbac..6f63a92bf9 100644
--- a/test/system/220-healthcheck.bats
+++ b/test/system/220-healthcheck.bats
@@ -474,13 +474,20 @@ function _check_health_log {
--health-cmd "sleep 20; echo $msg" \
$IMAGE /home/podman/pause
- run_podman 1 healthcheck run $ctr &
+ timeout --foreground -v --kill=10 60 \
+ $PODMAN healthcheck run $ctr &
+ hc_pid=$!
run_podman inspect $ctr --format "{{.State.Status}}"
- assert "$output" == "running" "Container is stopped"
+ assert "$output" == "running" "Container is running"
run_podman stop $ctr
+ # Wait for background healthcheck to finish and make sure the exit status is 1
+ rc=0
+ wait -n $hc_pid || rc=$?
+ assert $rc -eq 1 "exit status check of healthcheck command"
+
run_podman inspect $ctr --format "{{.State.Status}}"
assert "$output" == "exited" "Container is stopped"
That said I am not sure what this is supposed to test/assert? The test passes on main and your PR so it is not testing anything new from your PR. I get that it is impossible to test db writes here but I am not sure how this test relates to your code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for improving the test.
Yes, the test does not test any new features. However, I haven't found a similar test that verifies whether stopping the container when HealthCheck is running is possible. That's why I added a test for this case. (I got the suggestion to check this behavior from @mheon.)
I made a mistake during development and the container could not be stopped when HealthCheck was running. The container was locked during the whole time it was running. So when HealthCheck would run it would not be possible to do any operations with the container.
The current tests should ensure that the new run of HealthCheck works the same as the previous version with execSession
stored in the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to check if there is no other test. That said I am not sure the current behavior makes a lot of sense, the healthcheck process will be killed when the main ctr pid exits.
Looking at the container hc logs it still seems to be recorded as unhealthy which might be confusing.
I have not checked how docker behaves in such case, @mheon WDYT should we ignore the healtcheck result if the container was stopped and the hc was killed because of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked the implementations and I think there is a mismatch between the inspect healthcheck logs and the output of the healthcheck run
command. I think there should be a new state that indicates that the container has been stopped. WDYT @mheon @Luap99
HealthCheck results when HealthCheck is running and the container is stopped:
Docker implementation:
- HealthCheck status: unhealthy
- FailingStreak: 1
- Command line output: n/a
- Command line exit code: n/a
Docker doesn't have a
docker healthcheck run
or anything like that.
Podman implementation:
- HealthCheck status: healthy
- FailingStreak: 1
- Command line output: unhealthy
- Command line exit code: 1
New Podman implementation:
- HealthCheck status: healthy
- FailingStreak: 1
- Command line output: unhealthy
- Command line exit code: 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker doesn't have a docker healthcheck run or anything like that.
It is not about the command but what happens if the healthcheck is running while the container is stopped. I only really care about the docker/podman inspect healthcheck log. The output of podman healthcheck run is up to us to decide, IMO we should exit 0 to avoid leaking the transient systemd unit.
That said the details here are out of scope for this PR anyway, I will do another review here later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d7e9667
to
a61e378
Compare
/packit rebuild-failed |
/packit retest-failed |
/packit build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one thing that confused me for a while is the extra conmon involved as this useally means it spawns the ExitCommand which would cause unnecessary work and errors. However the ExitCommand is not set so all works well.
I guess in theory there is no need to proxy the exec through conmon at all and we could just call the oci runtime exit command directly as we have no need for any of the tty or ExitCommand. But that is certainly future work and out of scope for the db write issue you are fixing.
Two small nits in case you need to repush but I am fine with them as well
libpod/container_exec.go
Outdated
|
||
err = <-attachErrChan | ||
if err != nil { | ||
return -1, fmt.Errorf("Container %s light exec session with pid: %d error: %v", c.ID(), pid, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, container should be lower case.
All the golang errors should start lowercase as common style. I know it is not enforced and you may find other examples but for the most part they really need to be lower case.
libpod/container_exec.go
Outdated
@@ -775,6 +793,68 @@ func (c *Container) ExecResize(sessionID string, newSize resize.TerminalSize) er | |||
return c.ociRuntime.ExecAttachResize(c, sessionID, newSize) | |||
} | |||
|
|||
func (c *Container) healthCheckExec(config *ExecConfig, streams *define.AttachStreams) (exitCode int, retErr error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, neither exitCode
or retErr
are used so we do not need named return parameter so they can be dropped.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Honny1, Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a61e378
to
fd3b0aa
Compare
…database Fixes: https://issues.redhat.com/browse/RHEL-69970 Signed-off-by: Jan Rodák <[email protected]>
fd3b0aa
to
c684da0
Compare
@Luap99 I fixed the nits you mentioned. It looks like FreeBSD 13.3 is EOL. I will create a PR to bump the image to 13.4. |
This PR creates a method to run the HealthCheck command without creating and deleting an
ExecSession
in the database.When HealthCheck is run using the original
exec
method, anExecSession
is created and deleted. This approach causes unexpectedly higher IO usage when synchronizing the container and creating and deletingExecSession
in the database.The new
healthCheckExec
function locks the container and creates theExecSession
locally without writing to the database. Executes a localExecSession
. As a result, the number of writes in the database has been reduced to zero.Verify reduction
/bin/true
as a health check that runs every 10 seconds.timeout 120 stap check.stp 0x23 > stap.out
check.stp
:sed -E 's/([0-9]+)//' stap.out | sort | uniq -c | sort -bn | tail -n 3
Fixes: https://issues.redhat.com/browse/RHEL-69970
Does this PR introduce a user-facing change?