[BUG] IPs are not released when a node is gracefully shut down #550

sprat · 2025-01-17T14:15:11Z

Describe the bug

IPs are not released immediately when a node is gracefully shut down but only after the node has rebooted.

Expected behavior

IPs should be released immediately so that the new pods spawned to replace the killed pods can acquire IP addresses.

To Reproduce

I have enabled the graceful node shutdown feature (as described in https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown) with the following kubelet parameters:

shutdownGracePeriod: 60s
shutdownGracePeriodCriticalPods: 20s

Then I trigger a node shutdown by launching the reboot command on the node.

With the graceful shutdown feature, the pods on the nodes are killed (they end-up in Completed state) but not deleted by Kubernetes: I guess that's why the IP addresses are not released. I've voluntarily limited the number of IP addresses in the pool to demonstrate the problem: the new pods can't acquire IP addresses.

Environment:

Whereabouts version : v0.8.0
Kubernetes version (use kubectl version): v1.30.2
Network-attachment-definition:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-sriov
  labels:
    app: test-sriov
spec:
  replicas: 8
  selector:
    matchLabels:
      app: test-sriov
  template:
    metadata:
      labels:
        app: test-sriov
      annotations:
        k8s.v1.cni.cncf.io/networks: sriov-net-1
    spec:
      containers:
      - name: main
        image: nginx:latest
        resources:
          requests:
            intel.com/sriov_net_1: "1"
          limits:
            intel.com/sriov_net_1: "1"
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-net-1
  annotations:
    k8s.v1.cni.cncf.io/resourceName: "intel.com/sriov_net_1"
spec:
  config: '{
    "cniVersion": "0.3.1",
    "name": "sriov_net_1",
    "type": "sriov",
    "spoofchk": "off",
    "trust": "on",
    "ipam": {
      "type": "whereabouts",
      "range": "172.29.144.0/24",
      "range_start": "172.29.144.200",
      "range_end": "172.29.144.208"
    }
  }'

Whereabouts configuration (on the host):

{
  "datastore": "kubernetes",
  "kubernetes": {
    "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
  },
  "reconciler_cron_expression": "30 4 * * *"
}

OS (e.g. from /etc/os-release): AlmaLinux 9.4
Kernel (e.g. uname -a): 5.14.0-427.13.1.el9_4.x86_64
Others: N/A

Additional info / context

Sometimes some other CNI pods (e.g. calico, multus) fail to restart immediately after reboot causing even more problems.

The text was updated successfully, but these errors were encountered:

sprat mentioned this issue Jan 17, 2025

Duplicate IP addresses at scale: Possible read/write locking problem? #110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] IPs are not released when a node is gracefully shut down #550

[BUG] IPs are not released when a node is gracefully shut down #550

sprat commented Jan 17, 2025

[BUG] IPs are not released when a node is gracefully shut down #550

[BUG] IPs are not released when a node is gracefully shut down #550

Comments

sprat commented Jan 17, 2025