Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some runners are never terminated correctly #3901

Open
4 tasks done
julien-michaud opened this issue Jan 29, 2025 · 1 comment
Open
4 tasks done

Some runners are never terminated correctly #3901

julien-michaud opened this issue Jan 29, 2025 · 1 comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@julien-michaud
Copy link

julien-michaud commented Jan 29, 2025

Checks

Controller Version

0.10.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

install the 0.10.1 controller and start some jobs

Describe the bug

A few runners are stuck with this message

│ Events:                                                                                                                                                                                                                                                                    │
│   Type     Reason         Age                  From     Message                                                                                                                                                                                                            │
│   ----     ------         ----                 ----     -------                                                                                                                                                                                                            │
│   Normal   Killing        36m                  kubelet  Stopping container dind                                                                                                                                                                                            │
│   Warning  FailedKillPod  18m (x2 over 22m)    kubelet  error killing pod: [failed to "KillContainer" for "runner" with KillContainerError: "rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"cfdc4b7bdc85e7ac5233ffa784edb90ff494 │
│ 07a6fa73a7279bf629acf0d6319c\" to be killed: wait container \"cfdc4b7bdc85e7ac5233ffa784edb90ff49407a6fa73a7279bf629acf0d6319c\": context deadline exceeded", failed to "KillPodSandbox" for "48d6b275-ff6b-41bb-afd2-dc7a8c092bb0" with KillPodSandboxError: "rpc error:  │
│ code = DeadlineExceeded desc = failed to stop container \"cfdc4b7bdc85e7ac5233ffa784edb90ff49407a6fa73a7279bf629acf0d6319c\": an error occurs during waiting for container \"cfdc4b7bdc85e7ac5233ffa784edb90ff49407a6fa73a7279bf629acf0d6319c\" to be killed: wait contain │
│ er \"cfdc4b7bdc85e7ac5233ffa784edb90ff49407a6fa73a7279bf629acf0d6319c\": context deadline exceeded"]                                                                                                                                                                       │
│   Warning  FailedKillPod  13m                  kubelet  error killing pod: [failed to "KillContainer" for "runner" with KillContainerError: "rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"cfdc4b7bdc85e7ac5233ffa784edb90ff494 │
│ 07a6fa73a7279bf629acf0d6319c\" to be killed: wait container \"cfdc4b7bdc85e7ac5233ffa784edb90ff49407a6fa73a7279bf629acf0d6319c\": context deadline exceeded", failed to "KillPodSandbox" for "48d6b275-ff6b-41bb-afd2-dc7a8c092bb0" with KillPodSandboxError: "rpc error:  │
│ code = DeadlineExceeded desc = context deadline exceeded"]                                                                                                                                                                                                                 │
│   Warning  FailedKillPod  4m42s (x4 over 31m)  kubelet  error killing pod: [failed to "KillContainer" for "runner" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "48d6b275-ff6b-41bb-afd2 │
│ -dc7a8c092bb0" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]                                                                                                                                                            │
│   Warning  FailedKillPod  11s                  kubelet  error killing pod: [failed to "KillContainer" for "runner" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "48d6b275-ff6b-41bb-afd2 │
│ -dc7a8c092bb0" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = failed to stop container \"cfdc4b7bdc85e7ac5233ffa784edb90ff49407a6fa73a7279bf629acf0d6319c\": an error occurs during waiting for container \"cfdc4b7bdc85e7ac5233ffa784edb90ff49407a6 │
│ fa73a7279bf629acf0d6319c\" to be killed: wait container \"cfdc4b7bdc85e7ac5233ffa784edb90ff49407a6fa73a7279bf629acf0d6319c\": context deadline exceeded"]                                                                                                                  │
│   Normal   Killing        10s (x9 over 36m)    kubelet  Stopping container runner    

Is this something you guys encounter ?

Describe the expected behavior

runners are terminated normally

Additional Context

gke 1.30.8

Controller Logs

/

Runner Pod Logs

/
@julien-michaud julien-michaud added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jan 29, 2025
@julien-michaud
Copy link
Author

apiVersion: actions.github.com/v1alpha1
kind: AutoscalingRunnerSet
metadata:
  annotations:
    actions.github.com/cleanup-github-secret-name: company-hosted-gha-rs-github-secret
    actions.github.com/cleanup-manager-role-binding: company-hosted-gha-rs-manager
    actions.github.com/cleanup-manager-role-name: company-hosted-gha-rs-manager
    actions.github.com/cleanup-no-permission-service-account-name: company-hosted-gha-rs-no-permission
    actions.github.com/runner-group-name: company-hosted-blue
    actions.github.com/runner-scale-set-name: company-hosted
    actions.github.com/values-hash: 4c53ac934cbd22098dc59b28eaa7bcb3f8ee8076185610a84de3239b705bd91
    argocd.argoproj.io/tracking-id: >-
      actions-runner-controller-gke-live-labs-europe-west1-runner-scale-set-blue:actions.github.com/AutoscalingRunnerSet:gha-runner-scale-set-blue/company-hosted
    runner-scale-set-id: '762'
  creationTimestamp: '2025-01-08T16:00:42Z'
  finalizers:
    - autoscalingrunnerset.actions.github.com/finalizer
  generation: 2
  labels:
    actions.github.com/organization: company
    actions.github.com/scale-set-name: company-hosted
    actions.github.com/scale-set-namespace: gha-runner-scale-set-blue
    app.kubernetes.io/component: autoscaling-runner-set
    app.kubernetes.io/instance: company-hosted
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: company-hosted
    app.kubernetes.io/part-of: gha-rs
    app.kubernetes.io/version: 0.10.1
    helm.sh/chart: gha-rs-0.10.1
  name: company-hosted
  namespace: gha-runner-scale-set-blue
  resourceVersion: '1631900357'
  uid: 3a109c65-1c27-4e19-9823-55684a50cf4f
spec:
  githubConfigSecret: company-hosted-gha-rs-github-secret
  githubConfigUrl: https://github.com/company
  listenerTemplate:
    metadata:
      annotations:
        ad.datadoghq.com/listener.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "histogram_buckets_as_distributions": true,
                  "namespace": "actions-runner-system",
                  "max_returned_metrics": 6000,
                  "metrics": [".*"],
                  "exclude_metrics": [
                    "gha_job_startup_duration_seconds",
                    "gha_job_execution_duration_seconds"
                  ],
                  "exclude_labels": [
                    "enterprise",
                    "event_name",
                    "job_name",
                    "job_result",
                    "job_workflow_ref",
                    "organization",
                    "repository",
                    "runner_name"
                  ]
                }
              ]
            }
          }
        logs.company.com/datadog_source: gha-runner-scale-set
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cluster: gke-live-labs-europe-west1
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/region: europe-west1
        finops.company.net/service: actions-runner-controller
        finops.company.net/service_class: live
        finops.company.net/stage: prod
    spec:
      containers:
        - name: listener
          securityContext:
            runAsUser: 1000
  maxRunners: 100
  minRunners: 1
  runnerGroup: company-hosted-blue
  runnerScaleSetName: company-hosted
  template:
    metadata:
      annotations:
        logs.company.com/datadog_source: gha-runner-scale-set
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cluster: gke-live-labs-europe-west1
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/region: europe-west1
        finops.company.net/service: actions-runner-controller
        finops.company.net/service_class: live
        finops.company.net/stage: prod
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node_pool
                    operator: In
                    values:
                      - github-actions
      containers:
        - command:
            - /home/runner/run.sh
          env:
            - name: DOCKER_HOST
              value: unix:///var/run/docker.sock
            - name: RUNNER_WAIT_FOR_DOCKER_IN_SECONDS
              value: '120'
          image: >-
            europe-docker.pkg.dev/platform-zzzz/company-prod/devex/gha-runners:v0.1.25
          name: runner
          resources:
            requests:
              cpu: 4
          volumeMounts:
            - mountPath: /home/runner/_work
              name: work
            - mountPath: /var/run
              name: dind-sock
        - args:
            - dockerd
            - '--host=unix:///var/run/docker.sock'
            - '--group=$(DOCKER_GROUP_GID)'
          env:
            - name: DOCKER_GROUP_GID
              value: '123'
          image: docker:dind
          name: dind
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /home/runner/_work
              name: work
            - mountPath: /var/run
              name: dind-sock
            - mountPath: /home/runner/externals
              name: dind-externals
      imagePullSecrets:
        - name: company-prod-registry
      initContainers:
        - args:
            - '-r'
            - '-v'
            - /home/runner/externals/.
            - /home/runner/tmpDir/
          command:
            - cp
          image: >-
            europe-docker.pkg.dev/platform-zzzz/company-prod/devex/gha-runners:v0.1.25
          name: init-dind-externals
          volumeMounts:
            - mountPath: /home/runner/tmpDir
              name: dind-externals
      restartPolicy: OnFailure
      serviceAccountName: company-hosted-gha-rs-no-permission
      tolerations:
        - effect: NoSchedule
          key: github-actions
          operator: Equal
          value: 'true'
      topologySpreadConstraints:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: runner
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
      volumes:
        - emptyDir: {}
          name: dind-sock
        - emptyDir: {}
          name: dind-externals
        - emptyDir: {}
          name: work
status:
  currentRunners: 1
  pendingEphemeralRunners: 1
  runningEphemeralRunners: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

1 participant