Massive scaling and anomalous behaviors with the new forceful disruption method #1928

jorgeperezc · 2025-01-24T13:27:34Z

Description

Observed Behavior:
The releases of v1 brings a gift of a new approach regarding the node lifecycle, with graceful and forceful methods for draining and expiration.

Regarding node expiration, according to the Nodepool schema specification, the NodeTerminationGracePeriod property acts as a feature flag to enable a maximum threshold for recycling. If defined as null, the expiration controller will wait indefinitely, respecting do-not-disrupt, PDBs and so on.(Note that I do not remember reading this in the documentation).

  terminationGracePeriod	<string>
    TerminationGracePeriod is the maximum duration the controller will wait
    before forcefully deleting the pods on a node, measured from when deletion
    is first initiated.
    
    [...]

    The feature can also be used to allow maximum time limits for long-running
    jobs which can delay node termination with preStop hooks.
    If left undefined, the controller will wait indefinitely for pods to be
    drained.

Having said that, two erratic behaviors can be observed:

Forceful method enabled. The terminationGracePeriod property is defined as the maximum grace period threshold for draining the node's pods. When the expiration of NodeClaims begins (TTL specified in the expireAfter setting), they are marked with the annotation karpenter.sh/nodeclaim-termination-timestamp, indicating the maximum datetime for decommissioning, and the grace period countdown starts. The affected node workloads, regardless of PDBs and the do-not-disrupt annotation, are identified by the provisioner controller as reschedulable pods, causing the scheduler to determine whether to generate a new NodeClaim as a replacement based on the available capacity. We have use cases with extended grace periods and workloads with significant sizing but the scheduler does not consider the potential maximum grace period, provisioning replacements that might not be used until the application terminates. Additionally, there are pods nominated to be scheduled on the newly provisioned NodeClaims, blocking possible disruptions, lack of synchronization of the cluster state with the in-memory snapshot of Karpenter and extensive enqueuing of reconciliations by the provisioner, creating the perfect ingredients for a disaster. I believe it does not make sense to flip between provisioning and consolidation with resources that may not be used, leading to unnecessary costs. For example, jobs with a TTL of 5 hours that could use the entire grace period but from t0 already have an unused replacement. Aggressive consolidation budgets tend to worsen the situation, leading to more chaos.
Forceful method disabled. The terminationGracePeriod property is left undefined, which generates behavior similar to previous releases of v1 where PDBs and do-not-disrupt annotations are respected, causing the controller to wait indefinitely for the expired NodeClaims workloads to be drained. There are scenarios where this behavior is desired to minimize maximum disruption. In this case, an anomalous behavior occurs similar to the one mentioned earlier, with the difference that pods that cannot be drained are identified as reschedulable pods, leading to the provisioning of new NodeClaims that will never be used. The same flipping behavior persists along with the possibility risk of massive, uncontrolled scaling.

In addition to everything already mentioned, we must also consider the entropy generated by Kubernetes controllers: HPA scaling, new deployments, cronjobs leading to a possible reset of the consolidateAfter setting, suspending potential disruptions and the incorrect sizing of Karpenter pods. As a result of this last point, It could lead to a conflict between goroutines from different controllers, leading to excessive context switching, which degrades performance. Certain controllers may end up consuming more CPU time, resulting in greater disparity from the expected behavior. I am not sure if this is addressed in the documentation but it would be valuable to outline best practices for users who are unaware of the runtime and code behavior to avoid poor performance or discrepancies in the actions performed by controllers.

Example events observed:

Events:
  Type    Reason       Age       From       Message
  ----    ------       ----      ----       -------
  Normal   Nominated   38m     karpenter    Pod should schedule on: nodeclaim/test-5tu1t
  Normal   Nominated   27m     karpenter    Pod should schedule on: nodeclaim/test-qqkw6
  Normal   Nominated   16m     karpenter    Pod should schedule on: nodeclaim/test-nshgw
  Normal   Nominated   5m44s   karpenter    Pod should schedule on: nodeclaim/test-0tgjd

Events:
  Type    Reason           Age                  From       Message
  ----    ------           ----                 ----       -------
Normal  DisruptionBlocked  14s (x4 over 4m17s)  karpenter  Cannot disrupt NodeClaim: state node is nominated for a pending pod

Expected Behavior:

Forceful method enabled (expiration controller): The provisioner controller, particularly the scheduler, should consider the maximum time it may take to drain workloads before creating a replacement NodeClaim. It should also account for the average time required to provision new nodes. For example, if a workload consumes its 3 hours grace period (similar to nodeTerminationGracePeriod) and the average provisioning time for new nodes, the scheduler will create new NodeClaims with enough time before the forced decomission. This ensures the new replacement capacity is available while balancing both costs and reliability.
Forceful method disabled (expiration controller). The controller will respect workloads with PDBs and do-not-disrupt annotations on expired NodeClaims. The provisioner (scheduler) should not identify these pods as reschedulable, preventing the generation of new NodeClaims that will never be used, thus avoiding unnecessary costs.

I have submitted an illustrative PR demonstrating the expected behavior. It’s likely that the code’s current placement is not ideal and should be moved to the expiration or lifecycle controllers. I compiled those modifications and tested them in a development environment. They appear stable although I’m unsure if they might impact any other functionality.

Let me know if there is anything else I can do to help, as this issue is having a significant impact on costs and preventing access to features in EKS 1.31 that are unsupported by earlier v1 releases.

Reproduction Steps (Please include YAML):

Forceful method enabled:
- Generate a statefulSet with the do-not-disrupt annotation set to true and a terminationGracePeriodSeconds with a wide window. Keep in mind that, although the process may capture SIGTERM, it should continue running without terminating immediately to simulate the described behavior.
- The nodeTerminationGracePeriod property of the NodePool should be equal to or greater than the terminationGracePeriodSeconds of the StatefulSet.
- Define expireAfter in the NodePool to ensure the NodeClaims hosting the pods from your StatefulSet expire.
Forceful method disabled:
- Generate a statefulSet with the do-not-disrupt annotation set to true.
- Do not define a value for the nodeTerminationGracePeriod property of the NodePool or explicitly set it to null.
- Define expireAfter in the NodePool to ensure the NodeClaims hosting the pods of your StatefulSet expire.

Versions:

Karpenter v1.0.7
Chart Karpenter Version: 1.0.7
Chart Karpenter CRDs Version: 1.0.7
EKS v1.29

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2025-01-24T13:27:42Z

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jmdeal · 2025-02-01T01:47:45Z

I think there's some good points here, but I'd like to clarify a couple of things:

If defined as null, the expiration controller will wait indefinitely, respecting do-not-disrupt, PDBs and so on

This is not referring to the expiration controller, it is referring to the termination controller. Expiration will take place regardless of TGP or any other factor.

Forceful method disabled (expiration controller). The controller will respect workloads with PDBs and do-not-disrupt annotations on expired NodeClaims

This isn't the intended expiration behavior on v1+. The main purpose of expiration is to enforce maximum node lifetimes for compliance purposes, which isn't possible if it respects PDBs or do-not-disrupt annotations. We could consider reimplementing a "graceful" expiration method in the future, but most use cases we have heard are more appropriately addressed by drift. There's some more context in the design doc.

The change as you've suggested (IIUC) would have some unintended consequences for the remaining graceful disruption modes (drift and consolidation). At a high level, graceful disruption follows these steps:

Determine disruption candidate(s)
Provision replacement capacity
- If replacement capacity can't be created, bail
Set deletion timestamp on node and begin drain

If Karpenter does not continue to nominate nodes for pods on the disrupted nodes, those nodes may be consolidated during the drain process. At least when considering pods on nodes which were gracefully disrupted, this could end up increasing node churn.

That being said, I think there's a good argument for ignoring pods with the do-not-disrupt annotation. The intended use-case for the do-not-disrupt annotation is jobs, and ideally your TGP is large enough that it allows jobs to complete before the node is terminated. Pods being forcibly deleted should be the exception, not the norm. I don't think we want to ignore pods which are being blocked by PDBs though, since we can't predict when those PDBs will be unblocked. In a healthy cluster, PDBs should really just serve to rate limit disruption, rather than blocking indefinitely.

These are just my initial thoughts, I'd have to think through the ramifications for disruption some more, but I definitely think there's paths for improvement that we should explore.

jorgeperezc · 2025-02-04T11:09:51Z

Apologies for mixing up the concepts, you are right. I am trying to learn and find an explanation in the source code for the observed behavior.

The big problem I see is that if you do not set a TGP and you have workloads with the do-not-disrupt annotation, the termination controller will wait indefinitely. Meanwhile, if the cluster has not enough capacity, it will create another nodeClaim for the workload that cannot be drained. Therefore, we end up paying cloud providers for nothing.

https://github.com/kubernetes-sigs/karpenter/blob/v1.0.5/pkg/controllers/node/termination/terminator/terminator.go#L165

func (t *Terminator) DeleteExpiringPods(ctx context.Context, pods []*corev1.Pod, nodeGracePeriodTerminationTime *time.Time) error {
	for _, pod := range pods {
		// check if the node has an expiration time and the pod needs to be deleted
		deleteTime := t.podDeleteTimeWithGracePeriod(nodeGracePeriodTerminationTime, pod)
		if deleteTime != nil && ...
        }
...
return nil
}

* In this case, I see the deleteTime always will be nil, waiting for an eviction that will never happen.

I want to understand the possibility of not setting a TGP, as it seems to make no sense and try to align our config with the new behavior.

I am grateful for your time and help @jmdeal . What would be a use case for not setting a TGP, considering the waste of resources?

PS: Although migrating to the new version is being a bit painful, kudos to the team for the project!

Thanks.

jorgeperezc added the kind/bug Categorizes issue or PR as related to a bug. label Jan 24, 2025

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 24, 2025

jorgeperezc mentioned this issue Jan 24, 2025

Avoid overprovisioning capacity for workloads on expired nodes #1929

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive scaling and anomalous behaviors with the new forceful disruption method #1928

Massive scaling and anomalous behaviors with the new forceful disruption method #1928

jorgeperezc commented Jan 24, 2025 •

edited

Loading

k8s-ci-robot commented Jan 24, 2025

jmdeal commented Feb 1, 2025

jorgeperezc commented Feb 4, 2025

Massive scaling and anomalous behaviors with the new forceful disruption method #1928

Massive scaling and anomalous behaviors with the new forceful disruption method #1928

Comments

jorgeperezc commented Jan 24, 2025 • edited Loading

Description

k8s-ci-robot commented Jan 24, 2025

jmdeal commented Feb 1, 2025

jorgeperezc commented Feb 4, 2025

jorgeperezc commented Jan 24, 2025 •

edited

Loading