Some runners pods are never terminated #3903

julien-michaud · 2025-01-30T08:27:47Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.10.1

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

start jobs

Describe the bug

Sometimes, the runner pods continue running in zombie mode after completing their jobs.

Describe the expected behavior

runner pods should should be terminated after job completion

Additional Context

gha-runner-scale-set-controller:
  enabled: true
  flags:
    logLevel: "warn"
  podLabels:
    finops.company.net/cloud_provider: gcp
    finops.company.net/cost_center: compute
    finops.company.net/product: tools
    finops.company.net/service: actions-runner-controller
    finops.company.net/region: europe-west1
  replicaCount: 3
  podAnnotations:
    ad.datadoghq.com/manager.checks: |
      {
        "openmetrics": {
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8080/metrics",
              "histogram_buckets_as_distributions": true,
              "namespace": "actions-runner-system",
              "metrics": [".*"]
            }
          ]
        }
      }
  metrics:
    controllerManagerAddr: ":8080"
    listenerAddr: ":8080"
    listenerEndpoint: "/metrics"

gha-runner-scale-set:
  enabled: true
  githubConfigUrl: https://github.com/company
  githubConfigSecret:
    github_token: <path:secret/github_token/actions_runner_controller#token>

  maxRunners: 100
  minRunners: 1

  containerMode:
    type: "dind"  ## type can be set to dind or kubernetes

  listenerTemplate:
    metadata:
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/service: actions-runner-controller
        finops.company.net/region: europe-west1
      annotations:
        ad.datadoghq.com/listener.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "histogram_buckets_as_distributions": true,
                  "namespace": "actions-runner-system",
                  "max_returned_metrics": 6000,
                  "metrics": [".*"],
                  "exclude_metrics": [
                    "gha_job_startup_duration_seconds",
                    "gha_job_execution_duration_seconds"
                  ],
                  "exclude_labels": [
                    "enterprise",
                    "event_name",
                    "job_name",
                    "job_result",
                    "job_workflow_ref",
                    "organization",
                    "repository",
                    "runner_name"
                  ]
                }
              ]
            }
          }
    spec:
      containers:
      - name: listener
        securityContext:
          runAsUser: 1000
  template:
    metadata:
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/service: actions-runner-controller
        finops.company.net/region: europe-west1
    spec:
      restartPolicy: OnFailure
      imagePullSecrets:
        - name: company-prod-registry
      containers:
        - name: runner
          image: eu.gcr.io/company-production/devex/gha-runners:v1.0.0-snapshot5
          command: ["/home/runner/run.sh"]

  controllerServiceAccount:
    namespace: actions-runner-system
    name: actions-runner-controller-gha-rs-controller

Controller Logs

Date,Host,Service,Message
"2025-01-29T15:16:06.017Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:52.677Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:52.671Z","""node_name""","""manager""","Updated ephemeral runner status with pod phase"
"2025-01-29T15:15:52.657Z","""node_name""","""manager""","Updating ephemeral runner status with pod phase"
"2025-01-29T15:15:52.657Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:51.652Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:49.690Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.461Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.456Z","""node_name""","""manager""","Updated ephemeral runner status with pod phase"
"2025-01-29T15:15:48.440Z","""node_name""","""manager""","Updating ephemeral runner status with pod phase"
"2025-01-29T15:15:48.440Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.424Z","""node_name""","""manager""","Waiting for runner container status to be available"
"2025-01-29T15:15:48.399Z","""node_name""","""manager""","Created ephemeral runner pod"
"2025-01-29T15:15:48.367Z","""node_name""","""manager""","Created new pod spec for ephemeral runner"
"2025-01-29T15:15:48.366Z","""node_name""","""manager""","Creating new pod for ephemeral runner"
"2025-01-29T15:15:48.366Z","""node_name""","""manager""","Creating new EphemeralRunner pod."
"2025-01-29T15:15:48.361Z","""node_name""","""manager""","Created ephemeral runner secret"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Created new secret spec for ephemeral runner"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Creating new secret for ephemeral runner"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Creating new ephemeral runner secret for jitconfig."
"2025-01-29T15:15:48.308Z","""node_name""","""manager""","Updated ephemeral runner status with runnerId and runnerJITConfig"
"2025-01-29T15:15:48.294Z","""node_name""","""manager""","Updating ephemeral runner status with runnerId and runnerJITConfig"
"2025-01-29T15:15:48.294Z","""node_name""","""manager""","Created ephemeral runner JIT config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Creating ephemeral runner JIT config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Creating new ephemeral runner registration and updating status with runner config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Successfully added runner registration finalizer"
"2025-01-29T15:15:48.076Z","""node_name""","""manager""","Adding runner registration finalizer"
"2025-01-29T15:15:48.076Z","""node_name""","""manager""","Successfully added finalizer"
"2025-01-29T15:15:48.059Z","""node_name""","""manager""","Adding finalizer"

Runner Pod Logs

https://gist.github.com/julien-michaud/ce2a1e5c5d494d89e09453f0b270a26f

AblionGE · 2025-01-30T17:30:41Z

Hi @julien-michaud ,

I experience the same behaviour (I'm using the containerMode kubernetes).

I checked the processes of one of these instances and the steps were done (no more processes from the workflow) but the container stays there doing... nothing.

I encounter this issue especially when I have long running commands that don't write to the output (terraform plan with a ressource that is "huge" to compute (4-5 min locally to plan this specific ressource without writing anything in the output) or terraform/terragrunt validate on a "big" repository).

prizov · 2025-01-31T15:13:50Z

Hi @julien-michaud 👋
We discovered an issue in our environment. It has exactly the same symptoms as yours, although we have a different version of the runner and the controller in our setup. It turned out that some processes (node applications) inside the runner container got stuck in the D state - uninterruptible sleep, and thus the runners' pod wasn't terminated properly.

We reached out to GCP support, and they confirmed a regression introduced with the Container-Optimized OS (COS) versions between cos-113-18244-236-26 and cos-113-18244-236-70.

Here is what they suggested:

I would like to let you know that a regression introduced in Container-Optimized OS (COS)
versions between cos-113-18244-236-26 and cos-113-18244-236-70. This was later confirmed to be
related to the io_uring system calls. Current cos version ‘cos-113-18244-236-70’ in your
environment is also affected by this.

This issue has been forwarded to the product team and the team suggested upgrading to a newer COS
version (e.g., 1.30.9-gke.1009000(cos-113-18244-291-3) or later) containing the fix.

Downgrading to nodepool version 1.30.6-gke.1596000 (COS 113-18244-236-26) successfully mitigated
the issue for one of the customers. For smoother testing, start a new node pool on this version
‘1.30.6-gke.1596000 (COS 113-18244-236-26)’ and schedule one of the pods here?

The long term solution would be upgrading to a newer COS version (e.g., 1.30.9
gke.1009000(cos-113-18244-291-3)).

Based on the configuration you shared, I assume you're also running the runners on GKE. I hope this helps!

julien-michaud · 2025-02-03T08:58:07Z

Hi @julien-michaud 👋 We discovered an issue in our environment. It has exactly the same symptoms as yours, although we have a different version of the runner and the controller in our setup. It turned out that some processes (node applications) inside the runner container got stuck in the D state - uninterruptible sleep, and thus the runners' pod wasn't terminated properly.

We reached out to GCP support, and they confirmed a regression introduced with the Container-Optimized OS (COS) versions between cos-113-18244-236-26 and cos-113-18244-236-70.

Here is what they suggested:
I would like to let you know that a regression introduced in Container-Optimized OS (COS)
versions between cos-113-18244-236-26 and cos-113-18244-236-70. This was later confirmed to be
related to the io_uring system calls. Current cos version ‘cos-113-18244-236-70’ in your
environment is also affected by this.

This issue has been forwarded to the product team and the team suggested upgrading to a newer COS
version (e.g., 1.30.9-gke.1009000(cos-113-18244-291-3) or later) containing the fix.

Downgrading to nodepool version 1.30.6-gke.1596000 (COS 113-18244-236-26) successfully mitigated
the issue for one of the customers. For smoother testing, start a new node pool on this version
‘1.30.6-gke.1596000 (COS 113-18244-236-26)’ and schedule one of the pods here?

The long term solution would be upgrading to a newer COS version (e.g., 1.30.9
gke.1009000(cos-113-18244-291-3)).
Based on the configuration you shared, I assume you're also running the runners on GKE. I hope this helps!

Thanks a lot for the infos @prizov !

We just upgraded to cos-113-18244-236-77 🤞

julien-michaud added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some runners pods are never terminated #3903

Some runners pods are never terminated #3903

julien-michaud commented Jan 30, 2025 •

edited

Loading

AblionGE commented Jan 30, 2025

prizov commented Jan 31, 2025

julien-michaud commented Feb 3, 2025

Some runners pods are never terminated #3903

Some runners pods are never terminated #3903

Comments

julien-michaud commented Jan 30, 2025 • edited Loading

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

AblionGE commented Jan 30, 2025

prizov commented Jan 31, 2025

julien-michaud commented Feb 3, 2025

julien-michaud commented Jan 30, 2025 •

edited

Loading