Skip to content

feature(scheduler): Customizable pod selection and ordering in DefaultPreemption plugin #129983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

nickbp
Copy link

@nickbp nickbp commented Feb 4, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

The current plugin behavior is to select only based on pod priority, where any pod with higher priority can preempt any pod with lower priority. If this isn't desirable, there are two options: Reimplement this plugin from scratch, or update other users of pod priority (like scheduler queue ordering) to use something else. This introduces a way to customize the selection and ordering without needing to resort to that.

In our case, we have multiple priority classes but where only a subset of those are considered preemptible. Here's roughly how this looks:

  • High priority: higher in the scheduling queue, not preemptible
  • Low priority: lower in the scheduling queue, not preemptible
  • Preemptible priority: lowest in the scheduling queue, preemptible

This PR allows the preemption selection to be configured against the DefaultPreemption plugin, rather than needing to reimplement the plugin itself. The structure used here mimics what's currently being done for the Evaluator, where a PreemptPod callback may be overridden to customize what happens when performing a preemption.

This PR also updates the plugin's tests to call the updated New() function rather than constructing DefaultPreemption directly, so that New() is now being exercised in tests.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 4, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @nickbp!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 4, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @nickbp. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 4, 2025
@haosdent
Copy link
Member

haosdent commented Feb 5, 2025

High priority: higher in the scheduling queue, not preemptible
Low priority: lower in the scheduling queue, not preemptible

What if you set them as the same priority, and then they could not be preempted each other?

@nickbp
Copy link
Author

nickbp commented Feb 5, 2025

High priority: higher in the scheduling queue, not preemptible
Low priority: lower in the scheduling queue, not preemptible

What if you set them as the same priority, and then they could not be preempted each other?

The pod priority is also used by the PrioritySort plugin for the scheduling queue, so flattening the priority structure would lose that ordering as well.

@AxeZhan
Copy link
Member

AxeZhan commented Feb 5, 2025

/cc @kubernetes/sig-scheduling-misc

Changes like this will need a KEP instead of a single PR, but we can discuss based on this pr currently.

I think the starting point of this PR is very good. We have been struggling to keep some pods from being preempted.

My thought:
IIRC, we have discussed many times before whether preemption should strictly respect PDB.
Perhaps a configuration that make preemption strictly respects PDB and some PDBs protecting the pod would also solve your problem, while having better generalizability?

@k8s-ci-robot
Copy link
Contributor

@AxeZhan: GitHub didn't allow me to request PR reviews from the following users: kubernetes/sig-scheduling-misc.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @kubernetes/sig-scheduling-misc

Changes like this will need a KEP instead of a single PR, but we can discuss based on this pr currently.

I think the starting point of this PR is very good. We have been struggling to keep some pods from being preempted.

My thought:
IIRC, we have discussed many times before whether preemption should strictly respect PDB.
Perhaps a configuration that make preemption strictly respects PDB and some PDBs protecting the pod would also solve your problem, while having better generalizability?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sanposhiho
Copy link
Member

Looks like the same motivation as the preemptiontoleration plugin at our sub project, doesn't it?
https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg%2Fpreemptiontoleration%2FREADME.md

Although I, either way, agree to put this change since we can simplify the preemptiontoleration plugin implementation.

But, I'm not sure if it requires KEP because it's no API addition, a simple change, and no impact on the default scheduler's behavior.

@googs1025
Copy link
Member

In addition to pdb, PriorityClass can also be used to control which Pods can be preempted.

by the way: we have another issue to track: kubernetes-sigs/scheduler-plugins#839

@sanposhiho
Copy link
Member

sanposhiho commented Feb 5, 2025

Make an API (at PDB? PriorityClass?) for this kind of behavior is another story, which we can discuss separately. We need to make sure if there is enough demand for such behavior generally before proceeding for that, especially because we already have subproject and people can still just use that now. If there's only a small demand, then probably it's enough to keep it as a subproject and improve the features there.

For now, I believe we can just go ahead with this PR, that is, without KEP, and we can create a separate issue to see the reaction from users.

@haosdent
Copy link
Member

haosdent commented Feb 5, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 5, 2025
t.Fatal(err)
}
p, ok := pi.(*DefaultPreemption)
if !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a test case to verify that customized EligiblePods and OrderedPods (different from the default one) work as expected?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good point, I'll write up some tests that exercise customizing these.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a set of test cases overriding EligiblePods and OrderPods with various conceivable versions of selection logic


// EligiblePods returns victim pods which are allowed to be preempted by the provided preemptor.
// The default behavior is to allow any pods of lower priority to be preempted by any pods of higher priority.
EligiblePods func(nodeInfo *framework.NodeInfo, preemptor *v1.Pod) []*framework.PodInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, from what I've seen so far, this PR doesn't seem to introduce any changes compared to the original logic. If that's the case, I'm curious how does it provide the 'Customizable' feature? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass the custom functions when creating DefaultPreemption{}?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the idea is that someone would be able to assign their own functions when constructing the plugin. This should be easier to see once I've written some unit tests to exercise it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this has limitations, because we can only use this when working directly with DefaultPreemption{}. It cannot be used after calling New(...) since New returns an interface. 🤔

Copy link
Member

@AxeZhan AxeZhan Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not customizable for me.
My thought is there should be some configuration for users to set, like NodeResourcesFitArgs for example.
And that's why I said maybe we need a KEP at first🤔.
@sanposhiho @macsko What do you think? Can we create a new configuration without a KEP?

Copy link
Member

@sanposhiho sanposhiho Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be used after calling New(...) since New returns an interface

You can still do type assertions. But, I'm open to a discussion how to make it extensible in a better way.

Given, like I said, we already have Evaluator.Interface for a similar purpose, the alternative option would be having other funcs there, like:

type Interface interface {
...

// SelectVictimsOnNode finds...
//...
// Evaluator passes Interface.EligiblePodsFunc to eligiblePodsFunc.
  SelectVictimsOnNode(ctx context.Context, state *framework.CycleState,
		pod *v1.Pod, nodeInfo *framework.NodeInfo, pdbs []*policy.PodDisruptionBudget, 
		eligiblePodsFunc(nodeInfo *framework.NodeInfo, preemptor *v1.Pod) []*framework.PodInfo) ([]*v1.Pod, int, *framework.Status)
  EligiblePodsFunc(nodeInfo *framework.NodeInfo, preemptor *v1.Pod) []*framework.PodInfo
}

or maybe we can just put EligiblePodsFunc and OrderPods to Evaluator struct and add params to NewEvaluator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not customization for general users who want to use a default scheduler though, this is also customization just for a different layer of users.

In that case, I tend to we extend Evaluator.Interface here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sort of workaround here, I just recently added a call to construct a DefaultPreemption directly, so now there's:

  • New(): Returns framework.Plugin, aligns with PluginFactoryWithFts (no change)
  • NewDefaultPreemption(): Same arguments as New(), but returns DefaultPreemption without needing a type assertion.

How does this look?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe we can just put EligiblePodsFunc and OrderPods to Evaluator struct and add params to NewEvaluator.

If a new interface method is added, it may also be necessary to consider the possibility of destroying the original interface. 🤔 If the original logic is not changed, perhaps changing the interface will do more harm than good. This is just my opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may also be necessary to consider the possibility of destroying

Yeah, good point. So, we should avoid adding new funcs there so that we don't break user's implementation unnecessarily.

As a sort of workaround here, I just recently added a call to construct a DefaultPreemption directly,

I missed this change. The current one looks the best for me over the other ones that I mentioned above. This way makes it easier for users to replace the existing DefaultPreemption.Evaluator as well.

@@ -206,11 +227,11 @@ func (pl *DefaultPreemption) SelectVictimsOnNode(
if status := pl.fh.RunFilterPluginsWithNominatedPods(ctx, state, pod, nodeInfo); !status.IsSuccess() {
return nil, 0, status
}
var victims []*v1.Pod
var victims []*framework.PodInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep the current implementation []*v1.Pod ?

Copy link
Author

@nickbp nickbp Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this thinking:

  • Further down here, the full PodInfo is being used for removePod() which then calls fh.RunPrefilterExtensionRemovePod(), so mapping to Pods to be sorted, and then mapping the sorted result back to the original PodInfos would be tricky.
  • The provided OrderedPods function could have a use for other information in the PodInfo anyway


// OrderedPods sorts eligible victims in-place in descending order of highest to lowest priority.
// Pods at the start of the slice are less likely to be preempted.
OrderedPods func(eligible []*framework.PodInfo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will OrderPods be better since this returns a function instead of a sorted pod slice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that sounds better, I wasn't super sure about the naming myself. Renamed here and throughout

Copy link
Member

@macsko macsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good

var _ frameworkruntime.PluginFactoryWithFts = New

// NewDefaultPreemption initializes a new plugin and returns it. The plugin type is retained to allow modification.
func NewDefaultPreemption(_ context.Context, dpArgs runtime.Object, fh framework.Handle, fts feature.Features) (*DefaultPreemption, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This func doesn't belong to the interface, so context arg can be removed if unused

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done

return NewDefaultPreemption(ctx, dpArgs, fh, fts)
}

var _ frameworkruntime.PluginFactoryWithFts = New
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is redundant. Code won't compile if the New doesn't match frameworkruntime.PluginFactoryWithFts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I'd added this as a "just in case" but the build would indeed fail elsewhere. Removed.

for _, p := range s.Victims().Pods {
gotPodNames = append(gotPodNames, p.Name)
}
if !reflect.DeepEqual(expectPodNames, gotPodNames) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use cmp.Diff to compare

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +1794 to +1884
var pods []runtime.Object
pods = append(pods, test.pod)
for _, pod := range test.pods {
pods = append(pods, pod)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var pods []runtime.Object
pods = append(pods, test.pod)
for _, pod := range test.pods {
pods = append(pods, pod)
}
pods = append([]runtime.Object{test.pod}, test.pods...)

Copy link
Author

@nickbp nickbp Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a type error on this change:

default_preemption_test.go:1794:46: cannot use test.pods (variable of type []*"k8s.io/api/core/v1".Pod) as []"k8s.io/apimachinery/pkg/runtime".Object value in argument to append

Tried changing test.pods itself to be a []runtime.Object but the []*v1.Pod typing is needed further down when it's passed to internalcache.NewSnapshot().

I'll leave it as-is for now but LMK if there's something else to try here.

@nickbp
Copy link
Author

nickbp commented Feb 11, 2025

At this point I'm not aware of anything else to be done here, let me know if I'm missing anything.

@nickbp
Copy link
Author

nickbp commented May 12, 2025

Just did a rebase to latest upstream, let's see how it goes 🤞

@nickbp
Copy link
Author

nickbp commented May 13, 2025

Alrighty, CI is looking good again, thanks for the tip

@dom4ha
Copy link
Member

dom4ha commented May 13, 2025

/approve
/hold

@sanposhiho @macsko please have a look as well, especially to the change in runtime/registry.go to make the factory return the type. I don't see a problem with this change, but may be missing something.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 13, 2025
@macsko
Copy link
Member

macsko commented May 14, 2025

/approve

@@ -31,11 +31,11 @@ import (
type PluginFactory = func(ctx context.Context, configuration runtime.Object, f framework.Handle) (framework.Plugin, error)
Copy link
Member

@sanposhiho sanposhiho May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using generics looks like a good idea. But, can we modify this as well?, which I know isn't needed for this PR's change, but it's mainly for the convenience for other plugins.

Copy link
Author

@nickbp nickbp May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a few attempts just now and remembered that I think I'd tried this originally. Switching this to generics ends up causing typing problems around the Registry (aka map[string]PluginFactory) where we'd want everything to be matching types (returning framework.Plugin) anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I guess there is a way, but no need to do in this PR anyways. I'll follow up in another PR.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 17, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e688119535c58f29430e9c9839a58ceca7c6e8f2

@sanposhiho
Copy link
Member

/retest

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, macsko, nickbp, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sanposhiho
Copy link
Member

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants