Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler: Try disabling anti affinity check #1246

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

edude03
Copy link
Contributor

@edude03 edude03 commented Feb 4, 2025

After changing the over-provisioning pod deployment to have a zonal anti affinity, we noted that cluster autoscaler fails to create a node for the un-schedulable over provisioning pod. This is a known issue (see below, the comment right about the change mentions this) however this is of course not the behaviour we want.

While this isn't a proper fix, it should "force" cluster-autoscaler to provision new nodes giving us the extra capacity we hope to gain from the over-provisioning pods (which will then of course scheduling the over-provisioning pods)

@sharnoff sharnoff self-assigned this Feb 7, 2025
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few thoughts, general approach is ok, as discussed.

Please also prefix your PR title with cluster-autoscaler:

var err error
var remainingPods []*apiv1.Pod

+ klog.V(4).Infof("trying to schedule %d pods on existing nodes", len(podsEquivalenceGroup.Pods))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the print debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'd prefer to keep it in unless you strongly think we should take it out. If we run into issue with this again it'd be a lot easier to debug with the debugging messages than of course, adding them back and the back and forth that entails. Though - we default to v=4 I believe so I'd be happy to give it a higher verbosity (5?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'd like to minimize the size of our patch, to avoid conflicts when updating.

So in this case, if you think the debugging is useful, let's keep it in but revisit in 1-3 months to see if we can get rid of it?

@edude03 edude03 changed the title Try disabling anti affinity check cluster-autoscaler: Try disabling anti affinity check Feb 10, 2025
edude03 added a commit that referenced this pull request Feb 10, 2025
Simplifies our setup at the expense of longer image build times. Without
pushing cluster-autoscaler anywhere, it's hard to test changes from a
PR.

cc @mikhail-sakhnov re: #1138, @edude03 re: #1246

Co-authored-by: Michael Francis <[email protected]>
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, pending final unresolved comment

for _, eg := range podEquivalenceGroups {
samplePod := eg.Pods[0]
- if err := o.autoscalingContext.PredicateChecker.CheckPredicates(o.autoscalingContext.ClusterSnapshot, samplePod, nodeInfo.Node().Name); err == nil {
+ err := o.autoscalingContext.PredicateChecker.CheckPredicates(o.autoscalingContext.ClusterSnapshot, samplePod, nodeInfo.Node().Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's break this into multiple lines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants