Consolidation does not happen even when there is cheaper combination of instances available #1962

codeeong · 2025-02-05T07:55:20Z

Description

Observed Behavior:
For context, we wanted to leave the cost effectiveness decision to Karpenter, thus we gave a variety of instance types: c5a,c6a,m6a,m5a,c7a,r6a,r5a,r4, in large/xlarge/2xlarge, thinking the different combinations of cpu:memory options would allow Karpenter to make the best decisions for optimal usage to cost on our behalf.

However, based on the nodes karpenter chose for us, our memory utilization is pretty good at 90%, however cpu usage is very low (around 50%).
For instance, we have many c5a.xlarge instances (in the same AZ) that use less than 50% CPU. So two of these could be consolidated into a cheaper m6a.xlarge that has double the amount of memory and the same CPU. But the event on the node says
Normal Unconsolidatable 4m36s (x47 over 15h) karpenter Can't replace with a cheaper node

Instance types with CPU usage looks like this:

This ends up being more expensive than our original pre-provisioned nodepool of nodes (which was around 60-65% utilization for both cpu and memory).
To alleviate this issue, we have removed certain instance types from our list of instance types in our NodePool configuration. However we are curious to know if this is the expected behavior because if so then it seems users still have to understand which specific subset of instances fits the resource needs of our clusters, only then can we make use of Karpenter to minimise costs.

Expected Behavior:
We expect that we should see multi-node consolidation, defined as

Multi Node Consolidation - Try to delete two or more nodes in parallel, possibly launching a single replacement whose price is lower than that of all nodes being removed

For instance, we would expect to see 2 c5a.xlarge instances get consolidated into 1 m6a.xlarge as the cpu and memory would fit into that instance and cost lost.

Reproduction Steps (Please include YAML):
nodepool config:

  "object": {
      "apiVersion": "karpenter.sh/v1",
      "kind": "NodePool",
      "metadata": {
        "annotations": {
          "karpenter.sh/nodepool-hash": "10589712261218411145",
          "karpenter.sh/nodepool-hash-version": "v3"
        },
        "creationTimestamp": null,
        "deletionGracePeriodSeconds": null,
        "deletionTimestamp": null,
        "finalizers": null,
        "generateName": null,
        "generation": null,
        "labels": null,
        "managedFields": null,
        "name": "node-pool-1",
        "namespace": null,
        "ownerReferences": null,
        "resourceVersion": null,
        "selfLink": null,
        "uid": null
      },
      "spec": {
        "disruption": {
          "budgets": [
            {
              "duration": null,
              "nodes": "5%",
              "reasons": null,
              "schedule": null
            },
          ],
          "consolidateAfter": "30s",
          "consolidationPolicy": "WhenEmptyOrUnderutilized"
        },
        "limits": {
          "cpu": "140",
          "memory": "1000Gi"
        },
        "template": {
          "metadata": {
            "annotations": null,
            "labels": null
          },
          "spec": {
            "expireAfter": "Never",
            "nodeClassRef": {
              "group": "karpenter.k8s.aws",
              "kind": "EC2NodeClass",
              "name": "node-pool-1"
            },
            "requirements": [
              {
                "key": "node.kubernetes.io/instance-type",
                "minValues": null,
                "operator": "In",
                "values": [
                  "c5a.xlarge",
                  "c5a.2xlarge",
                  "c6a.xlarge",
                  "c6a.2xlarge",
                  "c7a.xlarge",
                  "c7a.2xlarge",
                  "m5a.xlarge",
                  "m5a.2xlarge",
                  "m6a.xlarge",
                  "m6a.2xlarge",
                  "r4.xlarge",
                  "r4.2xlarge",
                  "r5a.xlarge",
                  "r5a.2xlarge",
                  "r6a.xlarge",
                  "r6a.2xlarge"
                ]
              },
              {
                "key": "karpenter.sh/capacity-type",
                "minValues": null,
                "operator": "NotIn",
                "values": [
                  "spot"
                ]
              },
              {
                "key": "eks.amazonaws.com/capacityType",
                "minValues": null,
                "operator": "In",
                "values": [
                  "ON_DEMAND"
                ]
              },
              {
                "key": "topology.kubernetes.io/zone",
                "minValues": null,
                "operator": "In",
                "values": [
                  "ap-southeast-1a",
                  "ap-southeast-1b",
                  "ap-southeast-1c"
                ]
              }
            ],
            "startupTaints": null,
            "taints": null,
            "terminationGracePeriod": null
          }
        },
        "weight": null
      }
    },
    "timeouts": [],
    "wait": [],
    "wait_for": null
  }
}

Versions:

Chart Version: v1.1.1
Kubernetes Version (kubectl version): v1.30

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2025-02-05T07:55:28Z

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codeeong added the kind/bug Categorizes issue or PR as related to a bug. label Feb 5, 2025

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidation does not happen even when there is cheaper combination of instances available #1962

Consolidation does not happen even when there is cheaper combination of instances available #1962

codeeong commented Feb 5, 2025 •

edited

Loading

k8s-ci-robot commented Feb 5, 2025

Consolidation does not happen even when there is cheaper combination of instances available #1962

Consolidation does not happen even when there is cheaper combination of instances available #1962

Comments

codeeong commented Feb 5, 2025 • edited Loading

Description

k8s-ci-robot commented Feb 5, 2025

codeeong commented Feb 5, 2025 •

edited

Loading