Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run HealthCheck without saving the ExecSession to the database #25003

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented Jan 13, 2025

This PR creates a method to run the HealthCheck command without creating and deleting an ExecSession in the database.

When HealthCheck is run using the original exec method, an ExecSession is created and deleted. This approach causes unexpectedly higher IO usage when synchronizing the container and creating and deleting ExecSession in the database.

The new healthCheckExec function locks the container and creates the ExecSession locally without writing to the database. Executes a local ExecSession. As a result, the number of writes in the database has been reduced to zero.

Verify reduction

  • Start 30 containers with /bin/true as a health check that runs every 10 seconds.
  • Monitor writes for two mins: timeout 120 stap check.stp 0x23 > stap.out
    • Note: you will probably need to install debug symbols for the kernel: dnf debuginfo-install kernel-$(uname -r)

    • Script check.stp:
#!/usr/bin/stap

global mydev

probe begin

{ dev = usrdev2kerndev($1) mydev = MKDEV(MAJOR(dev), MINOR(dev)) }
probe vfs.write.return

{ if (dev == mydev) printf ("%s(%d)[%s(%d)] %s 0x%x %s\n", execname(), pid(), pexecname(), ppid(), ppfunc(), dev, fullpath_struct_file(task_current(), @entry($file))) }
  • Process result: sed -E 's/([0-9]+)//' stap.out | sort | uniq -c | sort -bn | tail -n 3
    • The result before should look like this:
4179 podman()[systemd(1)] vfs_write 0x23 /var/lib/containers/storage/db.sql
16115 podman()[systemd(1)] vfs_write 0x23 /var/lib/containers/storage/db.sql-journal
29857374 stapio()[stap(195325)] vfs_write 0x23 /home/jrodak/dev/stap.out

The result after applying this change should not contain /var/lib/containers/storage/db.sql and /var/lib/containers/storage/db.sql-journal or have a smaller first number on the line.

Fixes: https://issues.redhat.com/browse/RHEL-69970

Does this PR introduce a user-facing change?

The HelathCheck is executed without writing to the database. 

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels Jan 13, 2025
@Honny1 Honny1 force-pushed the no-db-healtcheck-exec branch from ac4e5b3 to 31172f4 Compare January 13, 2025 16:27
@Honny1 Honny1 changed the title Create exec method without database writes for HealthCheck execution Run HealthCheck without saving the ExecSession to the database Jan 13, 2025
@Honny1 Honny1 force-pushed the no-db-healtcheck-exec branch from 31172f4 to 79dddb6 Compare January 13, 2025 16:34
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

1 similar comment
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@Honny1 Honny1 added the No New Tests Allow PR to proceed without adding regression tests label Jan 13, 2025
@Honny1 Honny1 force-pushed the no-db-healtcheck-exec branch 2 times, most recently from a2e6f31 to 8434c76 Compare January 14, 2025 11:45
@Honny1 Honny1 removed the No New Tests Allow PR to proceed without adding regression tests label Jan 14, 2025
@Honny1 Honny1 force-pushed the no-db-healtcheck-exec branch from 8434c76 to 30e4aac Compare January 14, 2025 17:20
@Honny1
Copy link
Member Author

Honny1 commented Jan 15, 2025

/packit retest-failed

@Honny1
Copy link
Member Author

Honny1 commented Jan 15, 2025

This PR should not merge until after 5.4 branches.

We'd like this to sit in Fedora for a while before we put it in RHEL.

@Honny1 Honny1 force-pushed the no-db-healtcheck-exec branch from 30e4aac to d7e9667 Compare February 3, 2025 13:57
@Honny1 Honny1 marked this pull request as ready for review February 3, 2025 14:48
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2025
@mheon
Copy link
Member

mheon commented Feb 3, 2025

We're branched so this can be reviewed now

@Honny1
Copy link
Member Author

Honny1 commented Feb 5, 2025

PTAL: @mheon @Luap99

libpod/container_exec.go Show resolved Hide resolved
libpod/container_exec.go Outdated Show resolved Hide resolved
--health-cmd "sleep 20; echo $msg" \
$IMAGE /home/podman/pause

run_podman 1 healthcheck run $ctr &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work. This will check nothing at all.
You cannot run these function in the background as the asserts they make do not fail anything, it is always a good idea to check your test are failing when you write them, e.g. try to change the exit code and nothing will happen.

It must be something like

diff --git a/test/system/220-healthcheck.bats b/test/system/220-healthcheck.bats
index ba8b03dbac..6f63a92bf9 100644
--- a/test/system/220-healthcheck.bats
+++ b/test/system/220-healthcheck.bats
@@ -474,13 +474,20 @@ function _check_health_log {
            --health-cmd "sleep 20; echo $msg" \
            $IMAGE /home/podman/pause
 
-    run_podman 1 healthcheck run $ctr &
+    timeout --foreground -v --kill=10 60 \
+        $PODMAN healthcheck run $ctr &
+    hc_pid=$!
 
     run_podman inspect $ctr --format "{{.State.Status}}"
-    assert "$output" == "running" "Container is stopped"
+    assert "$output" == "running" "Container is running"
 
     run_podman stop $ctr
 
+    # Wait for background healthcheck to finish and make sure the exit status is 1
+    rc=0
+    wait -n $hc_pid || rc=$?
+    assert $rc -eq 1 "exit status check of healthcheck command"
+
     run_podman inspect $ctr --format "{{.State.Status}}"
     assert "$output" == "exited" "Container is stopped"

That said I am not sure what this is supposed to test/assert? The test passes on main and your PR so it is not testing anything new from your PR. I get that it is impossible to test db writes here but I am not sure how this test relates to your code changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for improving the test.

Yes, the test does not test any new features. However, I haven't found a similar test that verifies whether stopping the container when HealthCheck is running is possible. That's why I added a test for this case. (I got the suggestion to check this behavior from @mheon.)

I made a mistake during development and the container could not be stopped when HealthCheck was running. The container was locked during the whole time it was running. So when HealthCheck would run it would not be possible to do any operations with the container.

The current tests should ensure that the new run of HealthCheck works the same as the previous version with execSession stored in the database.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to check if there is no other test. That said I am not sure the current behavior makes a lot of sense, the healthcheck process will be killed when the main ctr pid exits.
Looking at the container hc logs it still seems to be recorded as unhealthy which might be confusing.
I have not checked how docker behaves in such case, @mheon WDYT should we ignore the healtcheck result if the container was stopped and the hc was killed because of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked the implementations and I think there is a mismatch between the inspect healthcheck logs and the output of the healthcheck run command. I think there should be a new state that indicates that the container has been stopped. WDYT @mheon @Luap99

HealthCheck results when HealthCheck is running and the container is stopped:
Docker implementation:

  • HealthCheck status: unhealthy
  • FailingStreak: 1
  • Command line output: n/a
  • Command line exit code: n/a

Docker doesn't have a docker healthcheck run or anything like that.

Podman implementation:

  • HealthCheck status: healthy
  • FailingStreak: 1
  • Command line output: unhealthy
  • Command line exit code: 1

New Podman implementation:

  • HealthCheck status: healthy
  • FailingStreak: 1
  • Command line output: unhealthy
  • Command line exit code: 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker doesn't have a docker healthcheck run or anything like that.

It is not about the command but what happens if the healthcheck is running while the container is stopped. I only really care about the docker/podman inspect healthcheck log. The output of podman healthcheck run is up to us to decide, IMO we should exit 0 to avoid leaking the transient systemd unit.

That said the details here are out of scope for this PR anyway, I will do another review here later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Honny1 Honny1 force-pushed the no-db-healtcheck-exec branch from d7e9667 to a61e378 Compare February 5, 2025 14:48
@Honny1
Copy link
Member Author

Honny1 commented Feb 5, 2025

/packit rebuild-failed

@Honny1
Copy link
Member Author

Honny1 commented Feb 5, 2025

/packit retest-failed

@Honny1
Copy link
Member Author

Honny1 commented Feb 6, 2025

/packit build

@Honny1 Honny1 requested a review from Luap99 February 6, 2025 18:45
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one thing that confused me for a while is the extra conmon involved as this useally means it spawns the ExitCommand which would cause unnecessary work and errors. However the ExitCommand is not set so all works well.

I guess in theory there is no need to proxy the exec through conmon at all and we could just call the oci runtime exit command directly as we have no need for any of the tty or ExitCommand. But that is certainly future work and out of scope for the db write issue you are fixing.

Two small nits in case you need to repush but I am fine with them as well


err = <-attachErrChan
if err != nil {
return -1, fmt.Errorf("Container %s light exec session with pid: %d error: %v", c.ID(), pid, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, container should be lower case.
All the golang errors should start lowercase as common style. I know it is not enforced and you may find other examples but for the most part they really need to be lower case.

@@ -775,6 +793,68 @@ func (c *Container) ExecResize(sessionID string, newSize resize.TerminalSize) er
return c.ociRuntime.ExecAttachResize(c, sessionID, newSize)
}

func (c *Container) healthCheckExec(config *ExecConfig, streams *define.AttachStreams) (exitCode int, retErr error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, neither exitCode or retErr are used so we do not need named return parameter so they can be dropped.

Copy link
Contributor

openshift-ci bot commented Feb 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Honny1, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2025
@Honny1 Honny1 force-pushed the no-db-healtcheck-exec branch from a61e378 to fd3b0aa Compare February 11, 2025 09:52
@Honny1 Honny1 force-pushed the no-db-healtcheck-exec branch from fd3b0aa to c684da0 Compare February 11, 2025 10:01
@Honny1
Copy link
Member Author

Honny1 commented Feb 11, 2025

@Luap99 I fixed the nits you mentioned. It looks like FreeBSD 13.3 is EOL. I will create a PR to bump the image to 13.4.

@Honny1
Copy link
Member Author

Honny1 commented Feb 11, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants