Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IPs are not released when a node is gracefully shut down #550

Open
sprat opened this issue Jan 17, 2025 · 0 comments
Open

[BUG] IPs are not released when a node is gracefully shut down #550

sprat opened this issue Jan 17, 2025 · 0 comments

Comments

@sprat
Copy link

sprat commented Jan 17, 2025

Describe the bug

IPs are not released immediately when a node is gracefully shut down but only after the node has rebooted.

Expected behavior

IPs should be released immediately so that the new pods spawned to replace the killed pods can acquire IP addresses.

To Reproduce

  1. I have enabled the graceful node shutdown feature (as described in https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown) with the following kubelet parameters:
  • shutdownGracePeriod: 60s
  • shutdownGracePeriodCriticalPods: 20s
  1. Then I trigger a node shutdown by launching the reboot command on the node.

With the graceful shutdown feature, the pods on the nodes are killed (they end-up in Completed state) but not deleted by Kubernetes: I guess that's why the IP addresses are not released. I've voluntarily limited the number of IP addresses in the pool to demonstrate the problem: the new pods can't acquire IP addresses.

Environment:

  • Whereabouts version : v0.8.0
  • Kubernetes version (use kubectl version): v1.30.2
  • Network-attachment-definition:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-sriov
  labels:
    app: test-sriov
spec:
  replicas: 8
  selector:
    matchLabels:
      app: test-sriov
  template:
    metadata:
      labels:
        app: test-sriov
      annotations:
        k8s.v1.cni.cncf.io/networks: sriov-net-1
    spec:
      containers:
      - name: main
        image: nginx:latest
        resources:
          requests:
            intel.com/sriov_net_1: "1"
          limits:
            intel.com/sriov_net_1: "1"
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-net-1
  annotations:
    k8s.v1.cni.cncf.io/resourceName: "intel.com/sriov_net_1"
spec:
  config: '{
    "cniVersion": "0.3.1",
    "name": "sriov_net_1",
    "type": "sriov",
    "spoofchk": "off",
    "trust": "on",
    "ipam": {
      "type": "whereabouts",
      "range": "172.29.144.0/24",
      "range_start": "172.29.144.200",
      "range_end": "172.29.144.208"
    }
  }'
  • Whereabouts configuration (on the host):
{
  "datastore": "kubernetes",
  "kubernetes": {
    "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
  },
  "reconciler_cron_expression": "30 4 * * *"
}
  • OS (e.g. from /etc/os-release): AlmaLinux 9.4
  • Kernel (e.g. uname -a): 5.14.0-427.13.1.el9_4.x86_64
  • Others: N/A

Additional info / context

Sometimes some other CNI pods (e.g. calico, multus) fail to restart immediately after reboot causing even more problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant