-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler: Try disabling anti affinity check #1246
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few thoughts, general approach is ok, as discussed.
Please also prefix your PR title with cluster-autoscaler:
var err error | ||
var remainingPods []*apiv1.Pod | ||
|
||
+ klog.V(4).Infof("trying to schedule %d pods on existing nodes", len(podsEquivalenceGroup.Pods)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the print debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'd prefer to keep it in unless you strongly think we should take it out. If we run into issue with this again it'd be a lot easier to debug with the debugging messages than of course, adding them back and the back and forth that entails. Though - we default to v=4 I believe so I'd be happy to give it a higher verbosity (5?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'd like to minimize the size of our patch, to avoid conflicts when updating.
So in this case, if you think the debugging is useful, let's keep it in but revisit in 1-3 months to see if we can get rid of it?
Simplifies our setup at the expense of longer image build times. Without pushing cluster-autoscaler anywhere, it's hard to test changes from a PR. cc @mikhail-sakhnov re: #1138, @edude03 re: #1246 Co-authored-by: Michael Francis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, pending final unresolved comment
for _, eg := range podEquivalenceGroups { | ||
samplePod := eg.Pods[0] | ||
- if err := o.autoscalingContext.PredicateChecker.CheckPredicates(o.autoscalingContext.ClusterSnapshot, samplePod, nodeInfo.Node().Name); err == nil { | ||
+ err := o.autoscalingContext.PredicateChecker.CheckPredicates(o.autoscalingContext.ClusterSnapshot, samplePod, nodeInfo.Node().Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's break this into multiple lines?
After changing the over-provisioning pod deployment to have a zonal anti affinity, we noted that cluster autoscaler fails to create a node for the un-schedulable over provisioning pod. This is a known issue (see below, the comment right about the change mentions this) however this is of course not the behaviour we want.
While this isn't a proper fix, it should "force" cluster-autoscaler to provision new nodes giving us the extra capacity we hope to gain from the over-provisioning pods (which will then of course scheduling the over-provisioning pods)