Scale down cilium-operator before deleting a cluster (only in eni mode) #1033

iuriaranda · 2025-02-12T15:16:17Z

What this PR does / why we need it

This is an attempt to prevent a race-condition on cluster deletion, where Cilium could leak ENIs, preventing the full removal of the VPC.

Towards https://github.com/giantswarm/giantswarm/issues/32350

Checklist

Updated CHANGELOG.md.

Trigger E2E tests

/run cluster-test-suites

This is an attempt to prevent a race-condition on cluster deletion, where Cilium could leak ENIs, preventing the full removal of the VPC. Towards giantswarm/giantswarm#32350

helm/cluster-aws/templates/cilium-cleanup/job.yaml

tinkerers-ci · 2025-02-12T16:40:21Z

cluster-test-suites

Run name	`pr-cluster-aws-1033-cluster-test-suitesq7kft`
Commit SHA	`7a0ce65`
Result	Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites

Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Co-authored-by: Jose Armesto <[email protected]>

…tate e.g. - if the kubeconfig secret does not exist - if the workload cluster API is inaccessible

iuriaranda · 2025-02-13T07:57:05Z

/run cluster-test-suites TARGET_SUITES=./providers/capa/standard

tinkerers-ci · 2025-02-13T08:56:09Z

cluster-test-suites

Run name	`pr-cluster-aws-1033-cluster-test-suitesh44vv`
Commit SHA	`5b21cb5`
Result	Completed ✅

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites

Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

github-actions · 2025-02-13T09:20:20Z

There were differences in the rendered Helm template, please check! ⚠️

Output

=== Differences when rendered with values file helm/cluster-aws/ci/test-auditd-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-eni-mode-values.yaml ===

(file level)
    ---
    # Source: cluster-aws/templates/cilium-cleanup/serviceaccount.yaml
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
      annotations:
        helm.sh/hook: pre-delete
        helm.sh/hook-delete-policy: "before-hook-creation,hook-succeeded,hook-failed"
        helm.sh/hook-weight: "-1"
      labels:
        app: cluster-aws
        app.kubernetes.io/managed-by: Helm
        cluster.x-k8s.io/cluster-name: test-wc
        giantswarm.io/cluster: test-wc
        giantswarm.io/organization: test
        cluster.x-k8s.io/watch-filter: capi
        helm.sh/chart: cluster-aws-2.6.1
        application.giantswarm.io/team: phoenix
        release.giantswarm.io/version: 29.1.0
    # Source: cluster-aws/templates/cilium-cleanup/role.yaml
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
      annotations:
        helm.sh/hook: pre-delete
        helm.sh/hook-delete-policy: "before-hook-creation,hook-succeeded,hook-failed"
        helm.sh/hook-weight: "-1"
      labels:
        app: cluster-aws
        app.kubernetes.io/managed-by: Helm
        cluster.x-k8s.io/cluster-name: test-wc
        giantswarm.io/cluster: test-wc
        giantswarm.io/organization: test
        cluster.x-k8s.io/watch-filter: capi
        helm.sh/chart: cluster-aws-2.6.1
        application.giantswarm.io/team: phoenix
        release.giantswarm.io/version: 29.1.0
    rules:
    - apiGroups:
      - 
      resources:
      - secrets
      verbs:
      - get
      resourceNames:
      - test-wc-kubeconfig
    # Source: cluster-aws/templates/cilium-cleanup/rolebinding.yaml
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
      annotations:
        helm.sh/hook: pre-delete
        helm.sh/hook-delete-policy: "before-hook-creation,hook-succeeded,hook-failed"
        helm.sh/hook-weight: "-1"
      labels:
        app: cluster-aws
        app.kubernetes.io/managed-by: Helm
        cluster.x-k8s.io/cluster-name: test-wc
        giantswarm.io/cluster: test-wc
        giantswarm.io/organization: test
        cluster.x-k8s.io/watch-filter: capi
        helm.sh/chart: cluster-aws-2.6.1
        application.giantswarm.io/team: phoenix
        release.giantswarm.io/version: 29.1.0
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: test-wc-cilium-cleanup
    subjects:
    - kind: ServiceAccount
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
    # Source: cluster-aws/templates/cilium-cleanup/job.yaml
    #
    # When working in eni mode and ENIs need to be deleted (e.g. on cluster deletion), Cilium first needs to detach
    # the ENIs from the EC2 instance, and then delete them.
    # If cilium-operator gets killed / removed after it detaches an ENI but before it deletes it, the ENI will be orphaned.
    #
    # This is a possible scenario on cluster deletion. To prevent it, this pre-delete hook will scale down cilium-operator before deleting.
    # This way ENIs remain attached to the EC2 instances and get automatically deleted by AWS upon instance termination.
    #
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
      annotations:
        helm.sh/hook: pre-delete
        helm.sh/hook-delete-policy: "before-hook-creation,hook-succeeded"
        helm.sh/hook-weight: 0
      labels:
        app: cluster-aws
        app.kubernetes.io/managed-by: Helm
        cluster.x-k8s.io/cluster-name: test-wc
        giantswarm.io/cluster: test-wc
        giantswarm.io/organization: test
        cluster.x-k8s.io/watch-filter: capi
        helm.sh/chart: cluster-aws-2.6.1
        application.giantswarm.io/team: phoenix
        release.giantswarm.io/version: 29.1.0
    spec:
      template:
        metadata:
          labels:
            app: cluster-aws
            app.kubernetes.io/managed-by: Helm
            cluster.x-k8s.io/cluster-name: test-wc
            giantswarm.io/cluster: test-wc
            giantswarm.io/organization: test
            cluster.x-k8s.io/watch-filter: capi
            helm.sh/chart: cluster-aws-2.6.1
            application.giantswarm.io/team: phoenix
            release.giantswarm.io/version: 29.1.0
        spec:
          serviceAccountName: test-wc-cilium-cleanup
          containers:
          - name: kubectl
            image: "gsoci.azurecr.io/giantswarm/kubectl:1.25.16"
            securityContext:
              runAsNonRoot: true
              runAsUser: 1000
              runAsGroup: 1000
              allowPrivilegeEscalation: false
              seccompProfile:
                type: RuntimeDefault
              capabilities:
                drop:
                - ALL
            env:
            - name: CLUSTER
              value: test-wc
            - name: NAMESPACE
              value: org-giantswarm
            command:
            - /bin/sh
            args:
            - "-c"
            - |
              echo "Fetching ${CLUSTER} kubeconfig..."
              kubectl get secret -n ${NAMESPACE} ${CLUSTER}-kubeconfig -o jsonpath='{.data.value}' | base64 -d > /tmp/${CLUSTER}-kubeconfig
              if [ ! -f /tmp/${CLUSTER}-kubeconfig ]; then
                echo "Failed to fetch kubeconfig for cluster ${CLUSTER}. Assuming it is already deleted."
                exit 0
              fi
              
              echo "Scaling down cilium-operator on cluster ${CLUSTER} as it is about to be deleted..."
              kubectl --kubeconfig /tmp/${CLUSTER}-kubeconfig scale --namespace kube-system deployment cilium-operator --replicas=0
              
              exit 0 # Ignore previous exit code, we don't want to block app deletion
              
            resources:
              requests:
                cpu: 10m
                memory: 64Mi
              limits:
                cpu: 100m
                memory: 256Mi
          restartPolicy: Never
      ttlSecondsAfterFinished: 86400 # 24h
    
  



=== Differences when rendered with values file helm/cluster-aws/ci/test-lifecycle-hook-heartbeattimeout-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-local-registry-cache-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-mc-proxy-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-multiple-authenticated-mirrors-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-multiple-service-account-issuers-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-network-topology-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-spot-instances-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-subnet-tags-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-wc-minimal-values.yaml ===

No difference

iuriaranda · 2025-02-13T09:57:31Z

Skipping the tests as they already ran before merging main

Scale down cilium-operator before deleting a cluster (only in eni mode)

7a0ce65

This is an attempt to prevent a race-condition on cluster deletion, where Cilium could leak ENIs, preventing the full removal of the VPC. Towards giantswarm/giantswarm#32350

iuriaranda requested a review from a team as a code owner February 12, 2025 15:16

fiunchinho reviewed Feb 12, 2025

View reviewed changes

helm/cluster-aws/templates/cilium-cleanup/job.yaml Outdated Show resolved Hide resolved

iuriaranda and others added 2 commits February 12, 2025 18:28

Update helm/cluster-aws/templates/cilium-cleanup/job.yaml

23cc514

Co-authored-by: Jose Armesto <[email protected]>

Make the job fail gracefully in case the cluster is in half-deleted s…

5b21cb5

…tate e.g. - if the kubeconfig secret does not exist - if the workload cluster API is inaccessible

fiunchinho approved these changes Feb 12, 2025

View reviewed changes

Merge branch 'main' into cilium-cleanup

90885e6

iuriaranda added the skip/ci Instructs PR Gatekeeper to ignore any required PR checks label Feb 13, 2025

iuriaranda merged commit 4d4c391 into main Feb 13, 2025
10 checks passed

iuriaranda deleted the cilium-cleanup branch February 13, 2025 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale down cilium-operator before deleting a cluster (only in eni mode) #1033

Scale down cilium-operator before deleting a cluster (only in eni mode) #1033

iuriaranda commented Feb 12, 2025

tinkerers-ci bot commented Feb 12, 2025

iuriaranda commented Feb 13, 2025

tinkerers-ci bot commented Feb 13, 2025

github-actions bot commented Feb 13, 2025

iuriaranda commented Feb 13, 2025

Scale down cilium-operator before deleting a cluster (only in eni mode) #1033

Scale down cilium-operator before deleting a cluster (only in eni mode) #1033

Conversation

iuriaranda commented Feb 12, 2025

What this PR does / why we need it

Checklist

Trigger E2E tests

tinkerers-ci bot commented Feb 12, 2025

cluster-test-suites

iuriaranda commented Feb 13, 2025

tinkerers-ci bot commented Feb 13, 2025

cluster-test-suites

github-actions bot commented Feb 13, 2025

iuriaranda commented Feb 13, 2025