Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some runners pods are never terminated #3903

Open
4 tasks done
julien-michaud opened this issue Jan 30, 2025 · 3 comments
Open
4 tasks done

Some runners pods are never terminated #3903

julien-michaud opened this issue Jan 30, 2025 · 3 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@julien-michaud
Copy link

julien-michaud commented Jan 30, 2025

Checks

Controller Version

0.10.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

start jobs

Describe the bug

Sometimes, the runner pods continue running in zombie mode after completing their jobs.

Describe the expected behavior

runner pods should should be terminated after job completion

Additional Context

gha-runner-scale-set-controller:
  enabled: true
  flags:
    logLevel: "warn"
  podLabels:
    finops.company.net/cloud_provider: gcp
    finops.company.net/cost_center: compute
    finops.company.net/product: tools
    finops.company.net/service: actions-runner-controller
    finops.company.net/region: europe-west1
  replicaCount: 3
  podAnnotations:
    ad.datadoghq.com/manager.checks: |
      {
        "openmetrics": {
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8080/metrics",
              "histogram_buckets_as_distributions": true,
              "namespace": "actions-runner-system",
              "metrics": [".*"]
            }
          ]
        }
      }
  metrics:
    controllerManagerAddr: ":8080"
    listenerAddr: ":8080"
    listenerEndpoint: "/metrics"

gha-runner-scale-set:
  enabled: true
  githubConfigUrl: https://github.com/company
  githubConfigSecret:
    github_token: <path:secret/github_token/actions_runner_controller#token>

  maxRunners: 100
  minRunners: 1

  containerMode:
    type: "dind"  ## type can be set to dind or kubernetes

  listenerTemplate:
    metadata:
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/service: actions-runner-controller
        finops.company.net/region: europe-west1
      annotations:
        ad.datadoghq.com/listener.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "histogram_buckets_as_distributions": true,
                  "namespace": "actions-runner-system",
                  "max_returned_metrics": 6000,
                  "metrics": [".*"],
                  "exclude_metrics": [
                    "gha_job_startup_duration_seconds",
                    "gha_job_execution_duration_seconds"
                  ],
                  "exclude_labels": [
                    "enterprise",
                    "event_name",
                    "job_name",
                    "job_result",
                    "job_workflow_ref",
                    "organization",
                    "repository",
                    "runner_name"
                  ]
                }
              ]
            }
          }
    spec:
      containers:
      - name: listener
        securityContext:
          runAsUser: 1000
  template:
    metadata:
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/service: actions-runner-controller
        finops.company.net/region: europe-west1
    spec:
      restartPolicy: OnFailure
      imagePullSecrets:
        - name: company-prod-registry
      containers:
        - name: runner
          image: eu.gcr.io/company-production/devex/gha-runners:v1.0.0-snapshot5
          command: ["/home/runner/run.sh"]

  controllerServiceAccount:
    namespace: actions-runner-system
    name: actions-runner-controller-gha-rs-controller

Controller Logs

Date,Host,Service,Message
"2025-01-29T15:16:06.017Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:52.677Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:52.671Z","""node_name""","""manager""","Updated ephemeral runner status with pod phase"
"2025-01-29T15:15:52.657Z","""node_name""","""manager""","Updating ephemeral runner status with pod phase"
"2025-01-29T15:15:52.657Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:51.652Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:49.690Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.461Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.456Z","""node_name""","""manager""","Updated ephemeral runner status with pod phase"
"2025-01-29T15:15:48.440Z","""node_name""","""manager""","Updating ephemeral runner status with pod phase"
"2025-01-29T15:15:48.440Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.424Z","""node_name""","""manager""","Waiting for runner container status to be available"
"2025-01-29T15:15:48.399Z","""node_name""","""manager""","Created ephemeral runner pod"
"2025-01-29T15:15:48.367Z","""node_name""","""manager""","Created new pod spec for ephemeral runner"
"2025-01-29T15:15:48.366Z","""node_name""","""manager""","Creating new pod for ephemeral runner"
"2025-01-29T15:15:48.366Z","""node_name""","""manager""","Creating new EphemeralRunner pod."
"2025-01-29T15:15:48.361Z","""node_name""","""manager""","Created ephemeral runner secret"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Created new secret spec for ephemeral runner"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Creating new secret for ephemeral runner"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Creating new ephemeral runner secret for jitconfig."
"2025-01-29T15:15:48.308Z","""node_name""","""manager""","Updated ephemeral runner status with runnerId and runnerJITConfig"
"2025-01-29T15:15:48.294Z","""node_name""","""manager""","Updating ephemeral runner status with runnerId and runnerJITConfig"
"2025-01-29T15:15:48.294Z","""node_name""","""manager""","Created ephemeral runner JIT config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Creating ephemeral runner JIT config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Creating new ephemeral runner registration and updating status with runner config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Successfully added runner registration finalizer"
"2025-01-29T15:15:48.076Z","""node_name""","""manager""","Adding runner registration finalizer"
"2025-01-29T15:15:48.076Z","""node_name""","""manager""","Successfully added finalizer"
"2025-01-29T15:15:48.059Z","""node_name""","""manager""","Adding finalizer"

Runner Pod Logs

https://gist.github.com/julien-michaud/ce2a1e5c5d494d89e09453f0b270a26f

@julien-michaud julien-michaud added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jan 30, 2025
@AblionGE
Copy link

Hi @julien-michaud ,

I experience the same behaviour (I'm using the containerMode kubernetes).

I checked the processes of one of these instances and the steps were done (no more processes from the workflow) but the container stays there doing... nothing.

I encounter this issue especially when I have long running commands that don't write to the output (terraform plan with a ressource that is "huge" to compute (4-5 min locally to plan this specific ressource without writing anything in the output) or terraform/terragrunt validate on a "big" repository).

@prizov
Copy link

prizov commented Jan 31, 2025

Hi @julien-michaud 👋
We discovered an issue in our environment. It has exactly the same symptoms as yours, although we have a different version of the runner and the controller in our setup. It turned out that some processes (node applications) inside the runner container got stuck in the D state - uninterruptible sleep, and thus the runners' pod wasn't terminated properly.

We reached out to GCP support, and they confirmed a regression introduced with the Container-Optimized OS (COS) versions between cos-113-18244-236-26 and cos-113-18244-236-70.

Here is what they suggested:

I would like to let you know that a regression introduced in Container-Optimized OS (COS)
versions between cos-113-18244-236-26 and cos-113-18244-236-70. This was later confirmed to be
related to the io_uring system calls. Current cos version ‘cos-113-18244-236-70’ in your
environment is also affected by this.

This issue has been forwarded to the product team and the team suggested upgrading to a newer COS
version (e.g., 1.30.9-gke.1009000(cos-113-18244-291-3) or later) containing the fix.

Downgrading to nodepool version 1.30.6-gke.1596000 (COS 113-18244-236-26) successfully mitigated
the issue for one of the customers. For smoother testing, start a new node pool on this version
‘1.30.6-gke.1596000 (COS 113-18244-236-26)’ and schedule one of the pods here?

The long term solution would be upgrading to a newer COS version (e.g., 1.30.9
gke.1009000(cos-113-18244-291-3)).

Based on the configuration you shared, I assume you're also running the runners on GKE. I hope this helps!

@julien-michaud
Copy link
Author

Hi @julien-michaud 👋 We discovered an issue in our environment. It has exactly the same symptoms as yours, although we have a different version of the runner and the controller in our setup. It turned out that some processes (node applications) inside the runner container got stuck in the D state - uninterruptible sleep, and thus the runners' pod wasn't terminated properly.

We reached out to GCP support, and they confirmed a regression introduced with the Container-Optimized OS (COS) versions between cos-113-18244-236-26 and cos-113-18244-236-70.

Here is what they suggested:

I would like to let you know that a regression introduced in Container-Optimized OS (COS)
versions between cos-113-18244-236-26 and cos-113-18244-236-70. This was later confirmed to be
related to the io_uring system calls. Current cos version ‘cos-113-18244-236-70’ in your
environment is also affected by this.

This issue has been forwarded to the product team and the team suggested upgrading to a newer COS
version (e.g., 1.30.9-gke.1009000(cos-113-18244-291-3) or later) containing the fix.

Downgrading to nodepool version 1.30.6-gke.1596000 (COS 113-18244-236-26) successfully mitigated
the issue for one of the customers. For smoother testing, start a new node pool on this version
‘1.30.6-gke.1596000 (COS 113-18244-236-26)’ and schedule one of the pods here?

The long term solution would be upgrading to a newer COS version (e.g., 1.30.9
gke.1009000(cos-113-18244-291-3)).

Based on the configuration you shared, I assume you're also running the runners on GKE. I hope this helps!

Thanks a lot for the infos @prizov !

We just upgraded to cos-113-18244-236-77 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

3 participants