-
Notifications
You must be signed in to change notification settings - Fork 1k
feat: dynamic ratelimiter for gracefuleviction #6675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
/* | ||
Copyright 2025 The Karmada Authors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package cluster | ||
|
||
import ( | ||
"time" | ||
|
||
"k8s.io/apimachinery/pkg/labels" | ||
"k8s.io/apimachinery/pkg/runtime/schema" | ||
"k8s.io/client-go/util/workqueue" | ||
"k8s.io/klog/v2" | ||
|
||
clusterv1alpha1 "github.com/karmada-io/karmada/pkg/apis/cluster/v1alpha1" | ||
"github.com/karmada-io/karmada/pkg/metrics" | ||
"github.com/karmada-io/karmada/pkg/sharedcli/ratelimiterflag" | ||
"github.com/karmada-io/karmada/pkg/util" | ||
"github.com/karmada-io/karmada/pkg/util/fedinformer/genericmanager" | ||
) | ||
|
||
// maxEvictionDelay is the maximum delay for eviction when the rate is 0 | ||
const maxEvictionDelay = 1800 * time.Second | ||
|
||
// DynamicRateLimiter adjusts its rate based on the overall health of clusters. | ||
// It implements the workqueue.RateLimiter interface with dynamic behavior. | ||
type DynamicRateLimiter[T comparable] struct { | ||
resourceEvictionRate float32 | ||
secondaryResourceEvictionRate float32 | ||
unhealthyClusterThreshold float32 | ||
largeClusterNumThreshold int | ||
informerManager genericmanager.SingleClusterInformerManager | ||
} | ||
|
||
// NewDynamicRateLimiter creates a new DynamicRateLimiter with the given options. | ||
func NewDynamicRateLimiter[T comparable](informerManager genericmanager.SingleClusterInformerManager, opts EvictionQueueOptions) workqueue.TypedRateLimiter[T] { | ||
return &DynamicRateLimiter[T]{ | ||
resourceEvictionRate: opts.ResourceEvictionRate, | ||
secondaryResourceEvictionRate: opts.SecondaryResourceEvictionRate, | ||
unhealthyClusterThreshold: opts.UnhealthyClusterThreshold, | ||
largeClusterNumThreshold: opts.LargeClusterNumThreshold, | ||
informerManager: informerManager, | ||
} | ||
} | ||
|
||
// When determines how long to wait before processing an item. | ||
// Returns a longer delay when the system is unhealthy. | ||
func (d *DynamicRateLimiter[T]) When(_ T) time.Duration { | ||
currentRate := d.getCurrentRate() | ||
klog.V(4).Infof("⏱️ DynamicRateLimiter: Current rate: %.2f/s", currentRate) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this clock icon be displayed properly on all systems? If you're unsure, should it be removed first? |
||
if currentRate == 0 { | ||
return maxEvictionDelay | ||
} | ||
return time.Duration(1 / currentRate * float32(time.Second)) | ||
} | ||
|
||
// getCurrentRate calculates the appropriate rate based on cluster health: | ||
// - Normal rate when system is healthy | ||
// - Secondary rate when system is unhealthy but large-scale | ||
// - Zero (halt evictions) when system is unhealthy and small-scale | ||
func (d *DynamicRateLimiter[T]) getCurrentRate() float32 { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add test code for these new additions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure ,I will upload the test code to this pr There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function contains the main computational logic, and a UT test can be added. |
||
clusterGVR := schema.GroupVersionResource{ | ||
Group: clusterv1alpha1.GroupName, | ||
Version: "v1alpha1", | ||
Resource: "clusters", | ||
} | ||
|
||
var lister = d.informerManager.Lister(clusterGVR) | ||
if lister == nil { | ||
klog.Errorf("Failed to get cluster lister, halting eviction for safety") | ||
return 0 | ||
} | ||
|
||
clusters, err := lister.List(labels.Everything()) | ||
if err != nil { | ||
klog.Errorf("Failed to list clusters from informer cache: %v, halting eviction for safety", err) | ||
return 0 | ||
} | ||
|
||
totalClusters := len(clusters) | ||
if totalClusters == 0 { | ||
return d.resourceEvictionRate | ||
} | ||
|
||
unhealthyClusters := 0 | ||
for _, clusterObj := range clusters { | ||
cluster, ok := clusterObj.(*clusterv1alpha1.Cluster) | ||
if !ok { | ||
continue | ||
} | ||
if !util.IsClusterReady(&cluster.Status) { | ||
unhealthyClusters++ | ||
} | ||
Comment on lines
+103
to
+105
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Directly determining whether a cluster is ready seems unreliable. Should we instead check whether the cluster has NoExecute taints? |
||
} | ||
|
||
// Update metrics | ||
failureRate := float32(unhealthyClusters) / float32(totalClusters) | ||
metrics.RecordClusterHealthMetrics(unhealthyClusters, float64(failureRate)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Has the logic related to metrics been split into PR #6778? |
||
|
||
// Determine rate based on health status | ||
isUnhealthy := failureRate > d.unhealthyClusterThreshold | ||
if !isUnhealthy { | ||
return d.resourceEvictionRate | ||
} | ||
|
||
isLargeScale := totalClusters > d.largeClusterNumThreshold | ||
if isLargeScale { | ||
klog.V(2).Infof("System is unhealthy (failure rate: %.2f), downgrading eviction rate to secondary rate: %.2f/s", | ||
failureRate, d.secondaryResourceEvictionRate) | ||
return d.secondaryResourceEvictionRate | ||
} | ||
|
||
klog.V(2).Infof("System is unhealthy (failure rate: %.2f) and instance is small, halting eviction.", failureRate) | ||
return 0 | ||
} | ||
|
||
// Forget is a no-op as this rate limiter doesn't track individual items. | ||
func (d *DynamicRateLimiter[T]) Forget(_ T) { | ||
// No-op | ||
} | ||
|
||
// NumRequeues always returns 0 as this rate limiter doesn't track retries. | ||
func (d *DynamicRateLimiter[T]) NumRequeues(_ T) int { | ||
return 0 | ||
} | ||
whosefriendA marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// NewGracefulEvictionRateLimiter creates a combined rate limiter for eviction. | ||
// It uses the maximum delay from both dynamic and default rate limiters to ensure | ||
// both cluster health and retry backoff are considered. | ||
func NewGracefulEvictionRateLimiter[T comparable]( | ||
informerManager genericmanager.SingleClusterInformerManager, | ||
evictionOpts EvictionQueueOptions, | ||
rateLimiterOpts ratelimiterflag.Options) workqueue.TypedRateLimiter[T] { | ||
dynamicLimiter := NewDynamicRateLimiter[T](informerManager, evictionOpts) | ||
defaultLimiter := ratelimiterflag.DefaultControllerRateLimiter[T](rateLimiterOpts) | ||
return workqueue.NewTypedMaxOfRateLimiter[T](dynamicLimiter, defaultLimiter) | ||
whosefriendA marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable refers to the eviction of one resource every 1000 seconds, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maxEvictionDelay := 1000 * time.Second specifies the maximum wait time the queue will impose on an element when the calculated currentRate == 0 (i.e., eviction should be paused). Rather than setting a rate of "one evict every 1000 seconds," it uses a long delay to pause processing, waking up after 1000 seconds to re-evaluate the rate. Normally, the processing interval is time.Duration(1/currentRate * time.Second). For example, if currentRate = 5/s, the queue will process one element every 200ms.