Run HealthCheck without saving the `ExecSession` to the database #25003

Honny1 · 2025-01-13T16:22:36Z

This PR creates a method to run the HealthCheck command without creating and deleting an ExecSession in the database.

When HealthCheck is run using the original exec method, an ExecSession is created and deleted. This approach causes unexpectedly higher IO usage when synchronizing the container and creating and deleting ExecSession in the database.

The new healthCheckExec function locks the container and creates the ExecSession locally without writing to the database. Executes a local ExecSession. As a result, the number of writes in the database has been reduced to zero.

Verify reduction

Start 30 containers with /bin/true as a health check that runs every 10 seconds.
Monitor writes for two mins: timeout 120 stap check.stp 0x23 > stap.out
- Note: you will probably need to install debug symbols for the kernel: dnf debuginfo-install kernel-$(uname -r)
- Script check.stp:

#!/usr/bin/stap

global mydev

probe begin

{ dev = usrdev2kerndev($1) mydev = MKDEV(MAJOR(dev), MINOR(dev)) }
probe vfs.write.return

{ if (dev == mydev) printf ("%s(%d)[%s(%d)] %s 0x%x %s\n", execname(), pid(), pexecname(), ppid(), ppfunc(), dev, fullpath_struct_file(task_current(), @entry($file))) }

Process result: sed -E 's/([0-9]+)//' stap.out | sort | uniq -c | sort -bn | tail -n 3
- The result before should look like this:

4179 podman()[systemd(1)] vfs_write 0x23 /var/lib/containers/storage/db.sql
16115 podman()[systemd(1)] vfs_write 0x23 /var/lib/containers/storage/db.sql-journal
29857374 stapio()[stap(195325)] vfs_write 0x23 /home/jrodak/dev/stap.out

The result after applying this change should not contain /var/lib/containers/storage/db.sql and /var/lib/containers/storage/db.sql-journal or have a smaller first number on the line.

Fixes: https://issues.redhat.com/browse/RHEL-69970

Does this PR introduce a user-facing change?

The HelathCheck is executed without writing to the database.

packit-as-a-service · 2025-01-13T16:34:54Z

Ephemeral COPR build failed. @containers/packit-build please check.

packit-as-a-service · 2025-01-13T16:34:55Z

Ephemeral COPR build failed. @containers/packit-build please check.

Honny1 · 2025-01-15T09:47:42Z

/packit retest-failed

Honny1 · 2025-01-15T13:35:55Z

This PR should not merge until after 5.4 branches.

We'd like this to sit in Fedora for a while before we put it in RHEL.

mheon · 2025-02-03T20:46:06Z

We're branched so this can be reviewed now

Honny1 · 2025-02-05T09:20:42Z

PTAL: @mheon @Luap99

libpod/container_exec.go

Luap99 · 2025-02-05T11:29:28Z

test/system/220-healthcheck.bats

+           --health-cmd "sleep 20; echo $msg" \
+           $IMAGE /home/podman/pause
+
+    run_podman 1 healthcheck run $ctr &


This does not work. This will check nothing at all.
You cannot run these function in the background as the asserts they make do not fail anything, it is always a good idea to check your test are failing when you write them, e.g. try to change the exit code and nothing will happen.

It must be something like

diff --git a/test/system/220-healthcheck.bats b/test/system/220-healthcheck.bats index ba8b03dbac..6f63a92bf9 100644 --- a/test/system/220-healthcheck.bats +++ b/test/system/220-healthcheck.bats @@ -474,13 +474,20 @@ function _check_health_log { --health-cmd "sleep 20; echo $msg" \ $IMAGE /home/podman/pause - run_podman 1 healthcheck run $ctr & + timeout --foreground -v --kill=10 60 \ + $PODMAN healthcheck run $ctr & + hc_pid=$! run_podman inspect $ctr --format "{{.State.Status}}" - assert "$output" == "running" "Container is stopped" + assert "$output" == "running" "Container is running" run_podman stop $ctr + # Wait for background healthcheck to finish and make sure the exit status is 1 + rc=0 + wait -n $hc_pid || rc=$? + assert $rc -eq 1 "exit status check of healthcheck command" + run_podman inspect $ctr --format "{{.State.Status}}" assert "$output" == "exited" "Container is stopped"

That said I am not sure what this is supposed to test/assert? The test passes on main and your PR so it is not testing anything new from your PR. I get that it is impossible to test db writes here but I am not sure how this test relates to your code changes.

Thank you for improving the test.

Yes, the test does not test any new features. However, I haven't found a similar test that verifies whether stopping the container when HealthCheck is running is possible. That's why I added a test for this case. (I got the suggestion to check this behavior from @mheon.)

I made a mistake during development and the container could not be stopped when HealthCheck was running. The container was locked during the whole time it was running. So when HealthCheck would run it would not be possible to do any operations with the container.

The current tests should ensure that the new run of HealthCheck works the same as the previous version with execSession stored in the database.

I think it is fine to check if there is no other test. That said I am not sure the current behavior makes a lot of sense, the healthcheck process will be killed when the main ctr pid exits.
Looking at the container hc logs it still seems to be recorded as unhealthy which might be confusing.
I have not checked how docker behaves in such case, @mheon WDYT should we ignore the healtcheck result if the container was stopped and the hc was killed because of it.

I have checked the implementations and I think there is a mismatch between the inspect healthcheck logs and the output of the healthcheck run command. I think there should be a new state that indicates that the container has been stopped. WDYT @mheon @Luap99

HealthCheck results when HealthCheck is running and the container is stopped:
Docker implementation:

HealthCheck status: unhealthy

FailingStreak: 1

Command line output: n/a

Command line exit code: n/a

Docker doesn't have a docker healthcheck run or anything like that.

Podman implementation:

HealthCheck status: healthy

FailingStreak: 1

Command line output: unhealthy

Command line exit code: 1

New Podman implementation:

HealthCheck status: healthy

FailingStreak: 1

Command line output: unhealthy

Command line exit code: 1

Docker doesn't have a docker healthcheck run or anything like that.

It is not about the command but what happens if the healthcheck is running while the container is stopped. I only really care about the docker/podman inspect healthcheck log. The output of podman healthcheck run is up to us to decide, IMO we should exit 0 to avoid leaking the transient systemd unit.

That said the details here are out of scope for this PR anyway, I will do another review here later.

Honny1 · 2025-02-05T15:44:29Z

/packit rebuild-failed

Honny1 · 2025-02-05T16:31:10Z

/packit retest-failed

Honny1 · 2025-02-06T17:26:55Z

/packit build

Luap99

LGTM, one thing that confused me for a while is the extra conmon involved as this useally means it spawns the ExitCommand which would cause unnecessary work and errors. However the ExitCommand is not set so all works well.

I guess in theory there is no need to proxy the exec through conmon at all and we could just call the oci runtime exit command directly as we have no need for any of the tty or ExitCommand. But that is certainly future work and out of scope for the db write issue you are fixing.

Two small nits in case you need to repush but I am fine with them as well

Luap99 · 2025-02-10T16:48:54Z

libpod/container_exec.go

+
+	err = <-attachErrChan
+	if err != nil {
+		return -1, fmt.Errorf("Container %s light exec session with pid: %d error: %v", c.ID(), pid, err)


nit, container should be lower case.
All the golang errors should start lowercase as common style. I know it is not enforced and you may find other examples but for the most part they really need to be lower case.

Luap99 · 2025-02-10T16:51:08Z

libpod/container_exec.go

@@ -775,6 +793,68 @@ func (c *Container) ExecResize(sessionID string, newSize resize.TerminalSize) er
 	return c.ociRuntime.ExecAttachResize(c, sessionID, newSize)
 }

+func (c *Container) healthCheckExec(config *ExecConfig, streams *define.AttachStreams) (exitCode int, retErr error) {


nit, neither exitCode or retErr are used so we do not need named return parameter so they can be dropped.

openshift-ci · 2025-02-10T17:17:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Honny1, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…database Fixes: https://issues.redhat.com/browse/RHEL-69970 Signed-off-by: Jan Rodák <[email protected]>

Honny1 · 2025-02-11T10:13:42Z

@Luap99 I fixed the nits you mentioned. It looks like FreeBSD 13.3 is EOL. I will create a PR to bump the image to 13.4.

Honny1 · 2025-02-11T10:26:05Z

[CI] Bump FreeBSD version to 13.4 #25288

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels Jan 13, 2025

Honny1 force-pushed the no-db-healtcheck-exec branch from ac4e5b3 to 31172f4 Compare January 13, 2025 16:27

Honny1 changed the title ~~Create exec method without database writes for HealthCheck execution~~ Run HealthCheck without saving the ExecSession to the database Jan 13, 2025

Honny1 force-pushed the no-db-healtcheck-exec branch from 31172f4 to 79dddb6 Compare January 13, 2025 16:34

Honny1 added the No New Tests Allow PR to proceed without adding regression tests label Jan 13, 2025

Honny1 force-pushed the no-db-healtcheck-exec branch 2 times, most recently from a2e6f31 to 8434c76 Compare January 14, 2025 11:45

Honny1 removed the No New Tests Allow PR to proceed without adding regression tests label Jan 14, 2025

Honny1 force-pushed the no-db-healtcheck-exec branch from 8434c76 to 30e4aac Compare January 14, 2025 17:20

Honny1 force-pushed the no-db-healtcheck-exec branch from 30e4aac to d7e9667 Compare February 3, 2025 13:57

Honny1 marked this pull request as ready for review February 3, 2025 14:48

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2025

Luap99 requested changes Feb 5, 2025

View reviewed changes

Honny1 force-pushed the no-db-healtcheck-exec branch from d7e9667 to a61e378 Compare February 5, 2025 14:48

Honny1 requested a review from Luap99 February 6, 2025 18:45

Luap99 approved these changes Feb 10, 2025

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2025

Honny1 force-pushed the no-db-healtcheck-exec branch from a61e378 to fd3b0aa Compare February 11, 2025 09:52

Run HealthCheck without creating and removing the ExecSession in the …

c684da0

…database Fixes: https://issues.redhat.com/browse/RHEL-69970 Signed-off-by: Jan Rodák <[email protected]>

Honny1 force-pushed the no-db-healtcheck-exec branch from fd3b0aa to c684da0 Compare February 11, 2025 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run HealthCheck without saving the `ExecSession` to the database #25003

Run HealthCheck without saving the `ExecSession` to the database #25003

Honny1 commented Jan 13, 2025 •

edited

Loading

packit-as-a-service bot commented Jan 13, 2025

packit-as-a-service bot commented Jan 13, 2025

Honny1 commented Jan 15, 2025

Honny1 commented Jan 15, 2025 •

edited

Loading

mheon commented Feb 3, 2025

Honny1 commented Feb 5, 2025

Luap99 Feb 5, 2025

Honny1 Feb 5, 2025

Luap99 Feb 5, 2025

Honny1 Feb 10, 2025

Luap99 Feb 10, 2025

Honny1 Feb 10, 2025

Honny1 commented Feb 5, 2025

Honny1 commented Feb 5, 2025

Honny1 commented Feb 6, 2025

Luap99 left a comment

Luap99 Feb 10, 2025

Luap99 Feb 10, 2025

openshift-ci bot commented Feb 10, 2025

Honny1 commented Feb 11, 2025

Honny1 commented Feb 11, 2025

Run HealthCheck without saving the ExecSession to the database #25003

Are you sure you want to change the base?

Run HealthCheck without saving the ExecSession to the database #25003

Conversation

Honny1 commented Jan 13, 2025 • edited Loading

Verify reduction

Does this PR introduce a user-facing change?

packit-as-a-service bot commented Jan 13, 2025

packit-as-a-service bot commented Jan 13, 2025

Honny1 commented Jan 15, 2025

Honny1 commented Jan 15, 2025 • edited Loading

mheon commented Feb 3, 2025

Honny1 commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Feb 5, 2025

Honny1 commented Feb 5, 2025

Honny1 commented Feb 6, 2025

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 10, 2025

Honny1 commented Feb 11, 2025

Honny1 commented Feb 11, 2025

Run HealthCheck without saving the `ExecSession` to the database #25003

Run HealthCheck without saving the `ExecSession` to the database #25003

Honny1 commented Jan 13, 2025 •

edited

Loading

Honny1 commented Jan 15, 2025 •

edited

Loading