Replies: 4 comments
-
Same question |
Beta Was this translation helpful? Give feedback.
0 replies
-
Same question too |
Beta Was this translation helpful? Give feedback.
0 replies
-
Same question |
Beta Was this translation helpful? Give feedback.
0 replies
-
Same question |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We are using the actions-runner-controller with self-hosted runners in our GHES environment. We have runners with different amounts of resources allocated to them, generically labelled small, medium, and large. I've been running some load tests with the runners to make sure we'll be able to have multiple workflows running at the same time, but have noticed that some runners will randomly receive a SIGTERM and begin to terminate. Specifically, this will always happen if I'm running a large number of jobs (at least 20) at once. From what I can tell, the controller is sending a termination signal to the runner that is still being used for a job. It gets logged that the runner couldn't be removed, because there is a job still running, but after waiting a while, it starts terminating the pod anyway. The runner has a grace period of 120 seconds before fully terminating, during which some jobs are able to complete, but the majority will still be running. At this point, the job gets cancelled and I see the following message:
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
In the controller logs, I see:
2023-07-25T19:53:55Z DEBUG runnerpod Failed to unregister runner before deleting the pod. {"runnerpod": "actions-runner-system/gha-runnerset-large-qrcp2-0", "error": "failed to remove runner: DELETE https://<github-url>/api/v3/enterprises/.../actions/runners/5722: 422 Bad request - Runner \"gha-runnerset-large-qrcp2-0\" is still running a job\" []"}
Then it shows the following message a couple of times:
2023-07-25T19:53:55Z INFO runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting {"runnerpod": "actions-runner-system/gha-runnerset-large-qrcp2-0"}
Finally, this:
2023-07-25T19:55:45Z INFO runnerpod Runner pod has been stopped with a successful status. {"runnerpod": "actions-runner-system/gha-runnerset-large-qrcp2-0"}
2023-07-25T19:56:41Z INFO runnerpod Failed to delete pod within 1m0s. This is typically the case when a Kubernetes node became unreachable and the kube controller started evicting nodes. Forcefully deleting the pod to not get stuck. {"runnerpod": "actions-runner-system/gha-runnerset-large-qrcp2-0", "podDeletionTimestamp": "2023-07-25 19:53:52 +0000 UTC", "currentTime": "2023-07-25T19:56:41Z", "configuredDeletionTimeout": "1m0s"}
Exactly a minute later, the pod gets deleted.
I've attached the workflow that I use to test, as well as the runnerset config.
runnerset.txt
workflow.txt
Does anyone know how to find out why the runners are getting terminated in the middle of the job in the first place?
Beta Was this translation helpful? Give feedback.
All reactions