Runners receiving a shutdown signal #2776

iftika1 · 2023-07-25T20:19:09Z

iftika1
Jul 25, 2023

We are using the actions-runner-controller with self-hosted runners in our GHES environment. We have runners with different amounts of resources allocated to them, generically labelled small, medium, and large. I've been running some load tests with the runners to make sure we'll be able to have multiple workflows running at the same time, but have noticed that some runners will randomly receive a SIGTERM and begin to terminate. Specifically, this will always happen if I'm running a large number of jobs (at least 20) at once. From what I can tell, the controller is sending a termination signal to the runner that is still being used for a job. It gets logged that the runner couldn't be removed, because there is a job still running, but after waiting a while, it starts terminating the pod anyway. The runner has a grace period of 120 seconds before fully terminating, during which some jobs are able to complete, but the majority will still be running. At this point, the job gets cancelled and I see the following message:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

In the controller logs, I see:
2023-07-25T19:53:55Z DEBUG runnerpod Failed to unregister runner before deleting the pod. {"runnerpod": "actions-runner-system/gha-runnerset-large-qrcp2-0", "error": "failed to remove runner: DELETE https://<github-url>/api/v3/enterprises/.../actions/runners/5722: 422 Bad request - Runner \"gha-runnerset-large-qrcp2-0\" is still running a job\" []"}

Then it shows the following message a couple of times:
2023-07-25T19:53:55Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/gha-runnerset-large-qrcp2-0"}

Finally, this:
2023-07-25T19:55:45Z INFO runnerpod Runner pod has been stopped with a successful status. {"runnerpod": "actions-runner-system/gha-runnerset-large-qrcp2-0"}
2023-07-25T19:56:41Z INFO runnerpod Failed to delete pod within 1m0s. This is typically the case when a Kubernetes node became unreachable and the kube controller started evicting nodes. Forcefully deleting the pod to not get stuck. {"runnerpod": "actions-runner-system/gha-runnerset-large-qrcp2-0", "podDeletionTimestamp": "2023-07-25 19:53:52 +0000 UTC", "currentTime": "2023-07-25T19:56:41Z", "configuredDeletionTimeout": "1m0s"}

Exactly a minute later, the pod gets deleted.

I've attached the workflow that I use to test, as well as the runnerset config.
runnerset.txt
workflow.txt

Does anyone know how to find out why the runners are getting terminated in the middle of the job in the first place?

aelnosu · 2023-10-20T18:02:18Z

aelnosu
Oct 20, 2023

Same question

0 replies

95jinhong · 2023-11-21T08:49:30Z

95jinhong
Nov 21, 2023

Same question too

0 replies

userqjin · 2024-03-21T21:10:33Z

userqjin
Mar 21, 2024

Same question

0 replies

Garethgr · 2025-02-03T09:25:53Z

Garethgr
Feb 3, 2025

Same question

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runners receiving a shutdown signal #2776

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Runners receiving a shutdown signal #2776

iftika1 Jul 25, 2023

Replies: 4 comments

aelnosu Oct 20, 2023

95jinhong Nov 21, 2023

userqjin Mar 21, 2024

Garethgr Feb 3, 2025

iftika1
Jul 25, 2023

aelnosu
Oct 20, 2023

95jinhong
Nov 21, 2023

userqjin
Mar 21, 2024

Garethgr
Feb 3, 2025