AAW Dev: Review node selection logic #2003

Jose-Matsuda · 2024-12-11T15:23:01Z

EPIC
Are there any pods that are on say the system nodepool that shouldnt be there? Are there any daemonsets deploying to nodes that they do not need to be deployed to?

The text was updated successfully, but these errors were encountered:

Souheil-Yazji · 2025-01-08T16:03:31Z

@jacek-dudek please update ticket with details on what you will be doing & update the status of the ticket accordingly.

jacek-dudek · 2025-02-05T15:59:58Z

Starting with the list of daemonsets getting deployed on the cluster and getting a description of what each one does.

k9s-daemonsets-output.ods

jacek-dudek · 2025-02-10T15:29:42Z

I found short descriptions of what each daemonset does to help us assess if they should be running on the cluster or a particular nodepool:

aad-pod-identity-nmi
nmi stands for node managed identity.
Part of solution that allows assigning pods azure active directory identities so they can access cloud resources and services.

csi-blob-node
Component of blob-csi-driver. This driver allows Kubernetes to access Azure Blob Storage.

fluentd-operator-fluentd-operator
Part of fluent operator logging tool, a universal solution for Kubernetes cluster logging.

azure-ip-masq-agent
The ip-masq-agent configures iptables rules to handle masquerading node/pod IP addresses when sending traffic to destinations outside the cluster node's IP and the Cluster IP range. This essentially hides pod IP addresses behind the cluster node's IP address. In some environments, traffic to "external" addresses must come from a known machine address. For example, in Google Cloud, any traffic to the internet must come from a VM's IP. When containers are used, as in Google Kubernetes Engine, the Pod IP will be rejected for egress. To avoid this, we must hide the Pod IP behind the VM's own IP address - generally known as "masquerade".

azure-npm
Azure network policy manager. Implements Kubernetes network policies for communication between pods within a cluster and also between pods and the outside world.

cloud-node-manager
Part of Azure cloud controller manager, which is Microsoft Azure's implementation of the Kubernetes cloud provider interface.

cloud-node-manager-windows
Probably same as cloud-node-manager but for nodes that are running windows.

csi-azuredisk-node
Part of azure disk CSI driver for kubernetes. This driver allows kubernetes to access azure disk volumes.

csi-azuredisk-node-win
Probably same as csi-azuredisk-node but for nodes that are running windows.

csi-azurefile-node
Part of azure file CSI driver for kubernetes. This driver allows kubernetes to access azure file shares using smb and nfs protocols.

csi-azurefile-node-win
Probably the same as csi-azurefile-node but for nodes running windows.

istio-cni-node
Is the istio daemonset definition implementing kubernetes container network interface. It's optional when istio mesh is implemented using sidecar containers but required when istio is running in ambient mode.

kube-proxy
Part of core kubernetes.

nvidia-device-plugin
Is a daemonset that implements the kubernetes device plugin framework.
It exposes the number of GPUs on each node of the cluster, keeps track of GPU health, and runs GPU enabled containers.

kube-prometheus-stack-prometheus-node-exporter
Prometheus Node Exporter provides hardware and OS-level system metrics exposed by the kernel.
It measures the following metrics:
Memory RAM total, RAM Used, RAM Cache, RAM Free
Disk Disk Space, IOPS, Mounts
CPU CPU Load, CPU Memory Disk
Network Network traffic, TCP flow, Connections
Node Exporter exposes metrics on ‘/metrics’ sub-path on port 9100.

sysctl
Not sure about this one.

jacek-dudek · 2025-02-19T14:55:00Z

In subsequent elab discussion it was confirmed that these are the changes to make:

Eliminate the following daemonsets from the cluster as we do not run any windows based machines: cloud-node-manager-windows, csi-azuredisk-node-win, csi-azurefile-node-win.
Eliminate these daemonsets, as we don't use azure file shares: csi-azurefile-node, csi-azurefile-node-win.
Implement node selection logic for nvidia-device-plugin that makes it run only on nodepools with machines with gpu processors.

Jose-Matsuda assigned jacek-dudek Dec 11, 2024

Jose-Matsuda mentioned this issue Dec 11, 2024

AAW Dev: Resource Utilization #1998

Open

10 tasks

jacek-dudek mentioned this issue Feb 19, 2025

Remove unused daemonsets from aaw-dev cluster #2024

Closed

Souheil-Yazji closed this as completed Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AAW Dev: Review node selection logic #2003

AAW Dev: Review node selection logic #2003

Jose-Matsuda commented Dec 11, 2024

Souheil-Yazji commented Jan 8, 2025 •

edited

Loading

jacek-dudek commented Feb 5, 2025 •

edited

Loading

jacek-dudek commented Feb 10, 2025

jacek-dudek commented Feb 19, 2025

AAW Dev: Review node selection logic #2003

AAW Dev: Review node selection logic #2003

Comments

Jose-Matsuda commented Dec 11, 2024

Souheil-Yazji commented Jan 8, 2025 • edited Loading

jacek-dudek commented Feb 5, 2025 • edited Loading

jacek-dudek commented Feb 10, 2025

jacek-dudek commented Feb 19, 2025

Souheil-Yazji commented Jan 8, 2025 •

edited

Loading

jacek-dudek commented Feb 5, 2025 •

edited

Loading