docs(scheduler): scheduler-extensibility#7044
docs(scheduler): scheduler-extensibility#7044CharlesQQ wants to merge 1 commit intokarmada-io:masterfrom
Conversation
Summary of ChangesHello @CharlesQQ, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 本 PR 提出了一个关于增强 Karmada 调度器可扩展性的设计提案。该提案旨在通过引入插件化机制和扩展 API,解决当前调度器在满足企业级定制化需求方面的局限性。核心思想是在不修改调度器核心代码的前提下,允许用户灵活地定义和实现复杂的调度策略,从而提升 Karmada 在多集群环境下的适应性和功能性。 Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
本提案为增强 Karmada 调度器的可扩展性提供了一个清晰的设计,通过引入 AdvancedScheduling 字段和新的 ReplicaSchedulingPlugin 扩展点,使得用户可以更灵活地定制调度策略。文档结构清晰,用户故事丰富。
我发现了一些可以改进的地方:
- “背景问题”部分内容缺失,建议补充以提供更完整的上下文。
- 故事5中的
PropagationPolicy示例缺少spec.replicas字段,并且其调度结果与描述不一致,可能会引起混淆。 - 从注解迁移到新API字段时,键名的命名规范发生了变化,建议在文档中明确说明,以提高清晰度。
这些修改将有助于提高提案的准确性和可读性。
| - **促销期间(50 副本)**: | ||
| - idc-self-cluster-1: 15 副本 | ||
| - idc-self-cluster-2: 15 副本 | ||
| - mixed-cloud-cluster: 50 副本(自动分配到 mixed 集群, 由子集群控制该副本数) |
|
|
||
| ## 动机 (Motivation) | ||
|
|
||
| ### 背景问题 |
| # PropagationPolicy 配置 | ||
| apiVersion: policy.karmada.io/v1alpha1 | ||
| kind: PropagationPolicy | ||
| metadata: | ||
| name: ecommerce-policy | ||
| annotations: | ||
| scheduler.karmada.io/replica-scheduling-strategy: | | ||
| { | ||
| "specifiedIdcs": [ | ||
| {"name": "idc-self", "replicas": 30} | ||
| ] | ||
| } | ||
| spec: | ||
| resourceSelectors: | ||
| - apiVersion: apps/v1 | ||
| kind: Deployment | ||
| name: ecommerce-app | ||
| placement: | ||
| clusterAffinity: | ||
| labelSelector: | ||
| matchLabels: | ||
| env: production | ||
| replicaScheduling: | ||
| replicaSchedulingType: Divided | ||
| ``` |
| #### 示例 1:指定机房及副本数(specified-idcs) | ||
|
|
||
| ```yaml | ||
| apiVersion: policy.karmada.io/v1alpha1 | ||
| kind: PropagationPolicy | ||
| metadata: | ||
| name: core-service-policy | ||
| spec: | ||
| resourceSelectors: | ||
| - apiVersion: apps/v1 | ||
| kind: Deployment | ||
| name: core-service | ||
| placement: | ||
| clusterAffinity: | ||
| labelSelector: | ||
| matchLabels: | ||
| env: production | ||
| advancedScheduling: | ||
| specified-idcs: | ||
| - name: "idc-east" | ||
| replicas: 20 | ||
| - name: "idc-north" | ||
| replicas: 10 | ||
| ``` | ||
|
|
||
| #### 示例 2:指定集群及副本数(specified-clusters) | ||
|
|
||
| ```yaml | ||
| apiVersion: policy.karmada.io/v1alpha1 | ||
| kind: PropagationPolicy | ||
| metadata: | ||
| name: precise-allocation-policy | ||
| spec: | ||
| resourceSelectors: | ||
| - apiVersion: apps/v1 | ||
| kind: Deployment | ||
| name: my-app | ||
| advancedScheduling: | ||
| specified-clusters: | ||
| - name: "cluster-1" | ||
| replicas: 15 | ||
| - name: "cluster-2" | ||
| replicas: 10 | ||
| - name: "cluster-3" | ||
| replicas: 5 | ||
| ``` |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7044 +/- ##
==========================================
- Coverage 46.62% 41.98% -4.64%
==========================================
Files 699 874 +175
Lines 48151 53542 +5391
==========================================
+ Hits 22450 22481 +31
- Misses 24013 29373 +5360
Partials 1688 1688
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I can see a lot of interesting use cases. Will give it another look later. |
b4a4108 to
3be7291
Compare
1d8dea9 to
a35b3bb
Compare
| **Alternative Using Karmada SpreadConstraints**: | ||
|
|
||
| ```yaml | ||
| apiVersion: policy.karmada.io/v1alpha1 | ||
| kind: PropagationPolicy | ||
| metadata: | ||
| name: microservice-policy-spread | ||
| spec: | ||
| resourceSelectors: | ||
| - apiVersion: apps/v1 | ||
| kind: Deployment | ||
| name: microservice-app | ||
| placement: | ||
| clusterAffinity: | ||
| labelSelector: | ||
| matchLabels: | ||
| env: production | ||
| spreadConstraints: | ||
| - spreadByLabel: topology.karmada.io/idc | ||
| maxGroups: 2 | ||
| minGroups: 2 | ||
| replicaScheduling: | ||
| replicaSchedulingType: Divided | ||
| replicas: 30 | ||
| ``` | ||
|
|
||
| **Explanation**: | ||
| - `spreadByLabel: topology.karmada.io/idc`: Distribute by IDC label dimension | ||
| - `minGroups: 2`: Ensure deployment in at least 2 IDCs | ||
| - `maxGroups: 2`: Limit deployment to at most 2 IDCs | ||
| - Prerequisite: Clusters need labels `topology.karmada.io/idc`, for example: | ||
| - East China clusters: `topology.karmada.io/idc: idc-east` | ||
| - North China clusters: `topology.karmada.io/idc: idc-north` | ||
| - The scheduler automatically distributes replicas between the two IDCs, with specific allocation determined by cluster resource conditions |
There was a problem hiding this comment.
It seems the alternative solution works. @CharlesQQ Have you tried it? Anything blocks you from this approach?
Personally, I would prefer this solution, but with limited changes:
- Set cluster region
.spec.Regionto all Clusters instead of labels for clarity. And usespreadByField: regionaccordingly. (optional) - For the replica scheduling strategy, I'd prefer
dynamicWeightfor this use case.
There was a problem hiding this comment.
From a dynamic scheduling perspective, this is feasible, and it can be perfectly replaced at the API level; However, the difference is that our scheduling strategy includes constraints based on allocation rate thresholds, which the native Karmada scheduler likely doesn't support, or perhaps only supports by estimator, but which we haven't used.
There was a problem hiding this comment.
Oh, I see. I'm interested in it.
Can you share some details regarding your allocation rate thresholds?
- How to set the thresholds for each cluster?
- How is the allocation rate of a cluster estimated? Based on total usage across all nodes? Based on what kind of resources? Like CPU/Mem?
- How is the threshold taken into account in the scheduling process?
There was a problem hiding this comment.
Okay, let me share our current approach:
-
We've customized a scheduling filter plugin based on an allocation rate threshold.
-
The cluster threshold is configured in another etcd instance. If a user modifies it, it will dynamically detect and reload the configuration in memory.
-
The allocation rate is calculated as follows: The
replicaRequirements.nodeClaimis obtained fromrb/crb. The denominator is the total resource amount of schedulable nodes calculated based on thenodeClaim. The numerator is the sum of pod requests with the samenodeClaim, i.e.,sum(pod.request(cpu/memory/gpu))/sum(node.allocatable(cpu/memory/gpu)). -
If the current allocation rate of
rbon a scheduled cluster exceeds the threshold, the cluster is directly filtered out (if replicas already exist, they remain unchanged, and the score plugin will not add new replicas to that cluster).
There was a problem hiding this comment.
sum(pod.request(cpu/memory/gpu))/sum(node.allocatable(cpu/memory/gpu))
Does that mean the allocation ratio is calculated separately for each resource specified in the requests?
Any resource allocation ratio exceeds the threshold, the filter plugin will reject the target cluster. For instance, if the CPU allocation ratio exceeds the threshold, even if there is enough memory, the target cluster still be rejected, right?
We've customized a scheduling filter plugin based on an allocation rate threshold.
Is the allocation ratio calculated inside the plugin? It's pretty heavy as the plugin has to iterate through all nodes during each scheduling cycle.
There was a problem hiding this comment.
Is the allocation ratio calculated inside the plugin?
Yes
It's pretty heavy as the plugin has to iterate through all nodes during each scheduling cycle.
The resource data for nodes and pods has already been loaded into memory via Informer. The computation does not involve I/O operations and is relatively fast (in the millisecond range), but it does consume some CPU.
There was a problem hiding this comment.
I know that means the scheduler would consume quite a large memory as it needs to cache all Nodes and Pods from all clusters.
There was a problem hiding this comment.
I know that means the scheduler would consume quite a large memory as it needs to cache all Nodes and Pods from all clusters.
That's it
There was a problem hiding this comment.
Thanks for the clarification. Opened #7137 for tracking the threshold thing, as it is obviously out of the scope of this proposal.
| - North China IDC: 10 replicas | ||
| - Within each IDC, the scheduler automatically distributes based on cluster resources | ||
|
|
||
| **Use Case Example**: A recommendation service depends on a third-party service that has 2x more instances in East China IDC than in North China IDC, while also requiring high availability across IDCs. For example, if the third-party service has 100 instances in East China and 50 in North China, the recommendation service needs to deploy in the same ratio (2:1) across both IDCs to ensure call efficiency and high availability while avoiding cross-IDC call latency. |
There was a problem hiding this comment.
Now I'm looking at the second use story, which is interesting, it requires specifying weight preferences by regions instead of clusters. This might be a good candidate feature that we can do .
My questions here are:
- Is the third-party service also managed by Karmada?
- Is there any particular reason that it has to have more instances in East China IDC than in North?
There was a problem hiding this comment.
Is the third-party service also managed by Karmada?
Yes
Is there any particular reason that it has to have more instances in East China IDC than in North?
Basically, it depends on the location of the third-party dependencies. For example, if the dependent MySQL master node is located in the East IDC, the business will prefer to allocate more instances in that IDC for faster read and write speeds. At the same time, to ensure high availability between idcs, some instances will also be deployed to the North IDC.
| **Background**: My application requires high availability deployment with balanced replica distribution across all clusters within specified IDCs. | ||
|
|
||
| **Requirements**: | ||
| - East China IDC: 20 replicas, balanced across 4 clusters | ||
| - North China IDC: 10 replicas, balanced across 2 clusters | ||
| - Minimize replica count differences between clusters |
There was a problem hiding this comment.
balanced across 4 clusters
balanced across 2 clusters
Here you don't limit the number of selected clusters, just aim to distribute replicas as evenly as possible across the chosen cluster. Is it right?
My question here is how do you handle a case where some clusters don't have sufficient resources?
There was a problem hiding this comment.
My question here is how do you handle a case where some clusters don't have sufficient resources?
Yes, this strategy only applies to Kubernetes clusters deployed on the public cloud. Once a cluster's resources exceed the capacity limit, an alert will be triggered. After determining that it is a genuine capacity issue, the SRE will intervene manually to scale up the system.
RainbowMango
left a comment
There was a problem hiding this comment.
Now I've gone through all the use stories. As we discussed earlier, my suggestion is that we extract some common requirements from these use cases and try to implement them directly in Karmada wherever possible. At the same time, we'll continue the extension mechanism proposed in this proposal.
|
|
||
| --- | ||
|
|
||
| #### Story 4: Specified Clusters with Replica Counts (Based on Strategy 4: SpecifiedClusters) |
There was a problem hiding this comment.
This story describes the scenario where people want to control replica assignments precisely; it is mainly used to migrate existing workloads to Karmada, but it can also be used in similar cases (pretty rare).
If Karmada supports this scenario with the PropagationPolicy where people can configure the assignments init, like:
spec:
placement:
replicas:
- bj-prod-cluster: 10
- sh-prod-cluster: 8
- gz-dr-cluster: 5It looks like a static placement declaration, but a question just raised:
What do you expect from karmada-scheduler? Like, does it need to check the available resources on each member cluster?
There was a problem hiding this comment.
What do you expect from karmada-scheduler? Like, does it need to check the available resources on each member cluster?
Currently, this strategy is mainly used in service migration scenarios. The scheduling result is constrained by the filtering strategy. If the available resources in the cluster are also filtered, it will have an impact.
In our scenario, the available resource strategy is placed in the assign replicas phase, not the filtering phase, therefore it will not...
There was a problem hiding this comment.
As we discussed earlier, my suggestion is that we extract some common requirements from these use cases and try to implement them directly in Karmada wherever possible
Yes, I think a new issue should be created to track this matter.
|
|
||
| **Requirements**: | ||
| - Normal operation with low traffic: Deploy only in self-built IDCs | ||
| - During promotions: Auto-scale using CronHPA, scheduling scaled replicas to cloud vendor clusters (mixed-type clusters) |
There was a problem hiding this comment.
This scenario is somehow related to the Extended Cluster Affinities feature. @zhzhuang-zju @vie-serendipity.
But my question here is: is it acceptable to scale the replicas in the self-built IDCs in case there are enough resources? If not, why?
There was a problem hiding this comment.
This is a cluster that uses a time-sharing mechanism for online and offline services; when the online service is scaled down, the offline service is scaled up, and the cluster resources are always fully utilized.
Offline services are used for some offline task calculations.
| **Requirements**: | ||
| - Only schedule to clusters with ARM nodes | ||
| - Verify node labels before scheduling to avoid scheduling failures |
There was a problem hiding this comment.
Do you need to check the available resources during the scheduling process?
If you only need a filter plugin to filter clusters with specified nodes according to the NodeSelector configuration in workloads, it's relatively simple.
There was a problem hiding this comment.
Yes, we've added a filtering plugin for NodeSelector; if it's not the SpecifiedClusters strategy, the available resources will be checked during the assign replicas phase.
a35b3bb to
6df6a77
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds a new design proposal document describing how to enhance Karmada scheduler extensibility via an AssignReplicasPlugin extension point and an advancedScheduling configuration field.
Changes:
- Introduces a new proposal doc under
docs/proposals/scheduler-extensibility/outlining motivations, API ideas, and plugin flow. - Provides multiple user stories and example configurations for IDC/cluster replica allocation strategies.
- Includes implementation-plan pseudocode and plugin examples intended to guide future implementation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| --- | ||
| title: Enhancing Karmada Scheduler Extensibility to Support Customized Requirements | ||
| authors: | ||
| - "charesQQ" |
There was a problem hiding this comment.
Front-matter authors should follow the repository proposal template and other proposals by listing GitHub handles prefixed with @ (for example - "@username"). This currently uses a plain string, which is inconsistent with existing proposal metadata and makes attribution harder to standardize.
| - "charesQQ" | |
| - "@charesQQ" |
| env: production | ||
| replicaScheduling: | ||
| replicaSchedulingType: Divided | ||
| replicas: 30 |
There was a problem hiding this comment.
The examples use a top-level spec.replicas field in PropagationPolicy (and later text says total replicas come from PropagationPolicy.spec.replicas), but the current PropagationPolicy API does not define spec.replicas (replicas are tracked on ResourceBinding.spec.replicas / the workload). These YAML examples and the surrounding explanation should be updated to reflect the actual API shape to avoid readers applying invalid manifests.
| replicas: 30 | |
| # Note: total replicas are defined on the Deployment spec (spec.replicas), not on this PropagationPolicy. |
|
|
||
| 4. **totalReplicas (int32)** | ||
| - Total number of replicas to allocate | ||
| - From PropagationPolicy's `spec.replicas` field |
There was a problem hiding this comment.
totalReplicas is described as coming from PropagationPolicy.spec.replicas, but PropagationPolicy has no spec.replicas field in the current API. Consider describing totalReplicas as sourced from ResourceBinding.spec.replicas (or the workload’s replica field) to match how the scheduler actually receives replica information.
| - From PropagationPolicy's `spec.replicas` field | |
| - From ResourceBinding's `spec.replicas` field (which reflects the workload's replica field, e.g., `Deployment.spec.replicas`) |
| // If user registered custom plugin, use custom plugin | ||
| // Otherwise use default plugin DefaultAssignReplicasPlugin | ||
| pluginName := getRegisteredAssignReplicasPlugin(registry) | ||
| plugin, err := registry[pluginName](nil, f) |
There was a problem hiding this comment.
The proposal’s plugin factory signature is inconsistent within the document: PluginFactory is defined as taking a single configuration runtime.Object, but here the registry factory is invoked with two arguments (registry[pluginName](nil, f)). Please align the factory type and all examples so readers can implement plugins against a single, coherent API.
| plugin, err := registry[pluginName](nil, f) | |
| plugin, err := registry[pluginName](nil) |
| **Q3: How do plugins read configuration?** | ||
|
|
||
| A: Plugins read configuration from `binding.Spec.AdvancedScheduling` map, where key is strategy name and value is JSON configuration. | ||
|
|
||
| **Q4: Will native assignReplicas logic be preserved?** |
There was a problem hiding this comment.
FAQ numbering is duplicated: this section introduces a second Q3 ("How do plugins read configuration?") after an existing Q3. Renumber this question and the following ones to keep the FAQ unambiguous.
| # Enhancing Karmada Scheduler Extensibility to Support Customized Requirements | ||
|
|
||
| ## Summary | ||
|
|
||
| While the current Karmada scheduler provides powerful multi-cluster scheduling capabilities, there are some limitations in meeting customized scheduling requirements when adopting Karmada in enterprise environments. This proposal aims to enhance the extensibility architecture of the Karmada scheduler to better support: |
There was a problem hiding this comment.
PR description still contains unfilled template placeholders (e.g., Fixes # without an issue number and no /kind ... selection). Please either link the actual issue(s) being fixed or remove the placeholder text so release automation and cross-references work as intended.
Signed-off-by: chang.qiangqiang <chang.qiangqiang@immomo.com>
6df6a77 to
1e4ceb0
Compare
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: