Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale down cilium-operator before deleting a cluster (only in eni mode) #1033

Merged
merged 4 commits into from
Feb 13, 2025

Conversation

iuriaranda
Copy link
Contributor

What this PR does / why we need it

This is an attempt to prevent a race-condition on cluster deletion, where Cilium could leak ENIs, preventing the full removal of the VPC.

Towards https://github.com/giantswarm/giantswarm/issues/32350

Checklist

  • Updated CHANGELOG.md.

Trigger E2E tests

/run cluster-test-suites

This is an attempt to prevent a race-condition on cluster deletion, where Cilium could leak ENIs, preventing the full removal of the VPC.

Towards giantswarm/giantswarm#32350
@iuriaranda iuriaranda requested a review from a team as a code owner February 12, 2025 15:16
@tinkerers-ci
Copy link

tinkerers-ci bot commented Feb 12, 2025

cluster-test-suites

Run name pr-cluster-aws-1033-cluster-test-suitesq7kft
Commit SHA 7a0ce65
Result Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

iuriaranda and others added 2 commits February 12, 2025 18:28
…tate

e.g.
- if the kubeconfig secret does not exist
- if the workload cluster API is inaccessible
@iuriaranda
Copy link
Contributor Author

/run cluster-test-suites TARGET_SUITES=./providers/capa/standard

@tinkerers-ci
Copy link

tinkerers-ci bot commented Feb 13, 2025

cluster-test-suites

Run name pr-cluster-aws-1033-cluster-test-suitesh44vv
Commit SHA 5b21cb5
Result Completed ✅

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Copy link
Contributor

There were differences in the rendered Helm template, please check! ⚠️

Output
=== Differences when rendered with values file helm/cluster-aws/ci/test-auditd-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-eni-mode-values.yaml ===

(file level)
    ---
    # Source: cluster-aws/templates/cilium-cleanup/serviceaccount.yaml
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
      annotations:
        helm.sh/hook: pre-delete
        helm.sh/hook-delete-policy: "before-hook-creation,hook-succeeded,hook-failed"
        helm.sh/hook-weight: "-1"
      labels:
        app: cluster-aws
        app.kubernetes.io/managed-by: Helm
        cluster.x-k8s.io/cluster-name: test-wc
        giantswarm.io/cluster: test-wc
        giantswarm.io/organization: test
        cluster.x-k8s.io/watch-filter: capi
        helm.sh/chart: cluster-aws-2.6.1
        application.giantswarm.io/team: phoenix
        release.giantswarm.io/version: 29.1.0
    # Source: cluster-aws/templates/cilium-cleanup/role.yaml
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
      annotations:
        helm.sh/hook: pre-delete
        helm.sh/hook-delete-policy: "before-hook-creation,hook-succeeded,hook-failed"
        helm.sh/hook-weight: "-1"
      labels:
        app: cluster-aws
        app.kubernetes.io/managed-by: Helm
        cluster.x-k8s.io/cluster-name: test-wc
        giantswarm.io/cluster: test-wc
        giantswarm.io/organization: test
        cluster.x-k8s.io/watch-filter: capi
        helm.sh/chart: cluster-aws-2.6.1
        application.giantswarm.io/team: phoenix
        release.giantswarm.io/version: 29.1.0
    rules:
    - apiGroups:
      - 
      resources:
      - secrets
      verbs:
      - get
      resourceNames:
      - test-wc-kubeconfig
    # Source: cluster-aws/templates/cilium-cleanup/rolebinding.yaml
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
      annotations:
        helm.sh/hook: pre-delete
        helm.sh/hook-delete-policy: "before-hook-creation,hook-succeeded,hook-failed"
        helm.sh/hook-weight: "-1"
      labels:
        app: cluster-aws
        app.kubernetes.io/managed-by: Helm
        cluster.x-k8s.io/cluster-name: test-wc
        giantswarm.io/cluster: test-wc
        giantswarm.io/organization: test
        cluster.x-k8s.io/watch-filter: capi
        helm.sh/chart: cluster-aws-2.6.1
        application.giantswarm.io/team: phoenix
        release.giantswarm.io/version: 29.1.0
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: test-wc-cilium-cleanup
    subjects:
    - kind: ServiceAccount
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
    # Source: cluster-aws/templates/cilium-cleanup/job.yaml
    #
    # When working in eni mode and ENIs need to be deleted (e.g. on cluster deletion), Cilium first needs to detach
    # the ENIs from the EC2 instance, and then delete them.
    # If cilium-operator gets killed / removed after it detaches an ENI but before it deletes it, the ENI will be orphaned.
    #
    # This is a possible scenario on cluster deletion. To prevent it, this pre-delete hook will scale down cilium-operator before deleting.
    # This way ENIs remain attached to the EC2 instances and get automatically deleted by AWS upon instance termination.
    #
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: test-wc-cilium-cleanup
      namespace: org-giantswarm
      annotations:
        helm.sh/hook: pre-delete
        helm.sh/hook-delete-policy: "before-hook-creation,hook-succeeded"
        helm.sh/hook-weight: 0
      labels:
        app: cluster-aws
        app.kubernetes.io/managed-by: Helm
        cluster.x-k8s.io/cluster-name: test-wc
        giantswarm.io/cluster: test-wc
        giantswarm.io/organization: test
        cluster.x-k8s.io/watch-filter: capi
        helm.sh/chart: cluster-aws-2.6.1
        application.giantswarm.io/team: phoenix
        release.giantswarm.io/version: 29.1.0
    spec:
      template:
        metadata:
          labels:
            app: cluster-aws
            app.kubernetes.io/managed-by: Helm
            cluster.x-k8s.io/cluster-name: test-wc
            giantswarm.io/cluster: test-wc
            giantswarm.io/organization: test
            cluster.x-k8s.io/watch-filter: capi
            helm.sh/chart: cluster-aws-2.6.1
            application.giantswarm.io/team: phoenix
            release.giantswarm.io/version: 29.1.0
        spec:
          serviceAccountName: test-wc-cilium-cleanup
          containers:
          - name: kubectl
            image: "gsoci.azurecr.io/giantswarm/kubectl:1.25.16"
            securityContext:
              runAsNonRoot: true
              runAsUser: 1000
              runAsGroup: 1000
              allowPrivilegeEscalation: false
              seccompProfile:
                type: RuntimeDefault
              capabilities:
                drop:
                - ALL
            env:
            - name: CLUSTER
              value: test-wc
            - name: NAMESPACE
              value: org-giantswarm
            command:
            - /bin/sh
            args:
            - "-c"
            - |
              echo "Fetching ${CLUSTER} kubeconfig..."
              kubectl get secret -n ${NAMESPACE} ${CLUSTER}-kubeconfig -o jsonpath='{.data.value}' | base64 -d > /tmp/${CLUSTER}-kubeconfig
              if [ ! -f /tmp/${CLUSTER}-kubeconfig ]; then
                echo "Failed to fetch kubeconfig for cluster ${CLUSTER}. Assuming it is already deleted."
                exit 0
              fi
              
              echo "Scaling down cilium-operator on cluster ${CLUSTER} as it is about to be deleted..."
              kubectl --kubeconfig /tmp/${CLUSTER}-kubeconfig scale --namespace kube-system deployment cilium-operator --replicas=0
              
              exit 0 # Ignore previous exit code, we don't want to block app deletion
              
            resources:
              requests:
                cpu: 10m
                memory: 64Mi
              limits:
                cpu: 100m
                memory: 256Mi
          restartPolicy: Never
      ttlSecondsAfterFinished: 86400 # 24h
    
  



=== Differences when rendered with values file helm/cluster-aws/ci/test-lifecycle-hook-heartbeattimeout-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-local-registry-cache-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-mc-proxy-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-multiple-authenticated-mirrors-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-multiple-service-account-issuers-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-network-topology-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-spot-instances-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-subnet-tags-values.yaml ===

No difference


=== Differences when rendered with values file helm/cluster-aws/ci/test-wc-minimal-values.yaml ===

No difference

@iuriaranda iuriaranda added the skip/ci Instructs PR Gatekeeper to ignore any required PR checks label Feb 13, 2025
@iuriaranda
Copy link
Contributor Author

Skipping the tests as they already ran before merging main

@iuriaranda iuriaranda merged commit 4d4c391 into main Feb 13, 2025
10 checks passed
@iuriaranda iuriaranda deleted the cilium-cleanup branch February 13, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip/ci Instructs PR Gatekeeper to ignore any required PR checks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants