-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add KEP for DRA: Extended Resource #5136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/assign @johnbelamaric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks @yliaog
/cc |
7ccd621
to
a1d3c16
Compare
ResourceClaim created in scheduler as proposed in this KEP is soly for recording the allocation results, such that kubelet can reference it during actuation. Currently a Pod can reference a resourceclaim template, then a resourceclaim controller can create that resourceclaim. What is proposed in this KEP is similar to that flow, in that Pod has 'extendedresouce' requests in its Spec, a controller (in this case, the scheduler) creates the resourceclaim. The reason for using scheduler (instead of the resourceclaim controller) to create the claim is due to that only scheduler has all the information needed to create the claim's requests, specifically, at the dynamicresources plugin's Filter phase. mem and cpu can be modeled the same way as proposed in this KEP for extended resources, are they are all of the same type (string -> an integer, example.com/gpu: 1, cpu: 1, mem: 1G) webhook is not a good choice because |
I understand that, but still, that does not sound right. ResourceClaims define the scheduling intent and hold the result of that intent. I think we still should be able to define the intent that says: "Allocate an extended resource backed by DRA plugin unless the one backed by device plugin was allocated". No exceptions would be needed other than optionally not allocating such claim if extended resource was allocated. So both DRA and extended resource scheduler plugins need to be aware of each other, which sounds reasonable considering the feature integrates both concepts.
The resource claim controller seems more suitable for creating such ResourceClaim then, as it's responsible for preparing the "scheduling intent" based on pods definition. Creating objects in schedulers may unnecessarily complicates many things that are currently hard to predict, so it sounds like asking for troubles. |
The intent is given by extended resource as given in spec.resources.requests (e.g. example.com/gpu: 1), there is no need for creating a claim for expressing such intent. Scheduler can act on such intent (example.com/gpu: 1) to make the allocation decision, it could be the noderesources plugin that satisfy such intent (in that case, the resources are provided by node.status.capacity), or it could be the dynamicresource plugin that satisfy such intent (in that case, the resources are provided by DRA resource slices). In short, the intent is clearly specified in spec.resources.requests, resourceclaim is not necessary for the purpose of specifying the intent. Instead, it is created for the purpose of recording the allocation result. There is no extened resource plugin, instead, extended resources are handled by noderesources plugin, similar to cpu/mem resources.
As mentioned above, there is no need for preparing the "scheduling intent", as the intent is already well specified by pod.spec.resources.requests (e.g. example.com/gpu: 1) this is not the first time to create objects in scheduler, currently scheduler creates Binding. Creating this resourceclaim won't create another 'scheduling inent', as it is not associated with any pod.spec. Hence there is no circular dependency (i.e. there is no such case that scheduler creates the claim, which is depended on by the scheduler to act on). It is logically very clear that the scheduler creates the claim, and kublet consumes this claim for actuation. |
I'm not aware of any object that scheduler creates. The way the scheduling is documented to the users is that it's a process of assigning Pods to Nodes and ResourceClaims to ResourceSlices, which requires updating of existing objects, but not creating them. There was a decision taken that the DRA allocation is performed by updating status of ResourceClaim objects, because it was assumed that such object must always exist. The alternative was to have separate objects that would be dedicated to hold allocations, but scheduler would have to create them, so there would be a problem of their garbage collection etc, so exactly what is proposed here. Note also that there are external schedulers (e.g. Kueue or autoscaling) that may start using resource nomination concept to instruct scheduler how to schedule (bind) pods. This means that they would have to create missing ResourceClaim when needed and scheduler would have to garbage collect it when it changes their decision. Since we have two design options and one of them is aligned with the decision that allocation is a part of preexisting ResourceClaim object, I don't see a reason why we'd like to chose the different option here. |
Why? Either they use the extended resources API, then they don't create ResourceClaims, or they use ResourceClaims, then the scheduler needs to honor that decision and doesn't need to garbage collect.
Because what you are proposing fails to satisfy one important motivation for this KEP, "Enable application developers and operators to transition to DRA gradually at their own pace." The intended usage is that admins convert nodes from device plugins to DRA gradually, instead of having to take down the entire cluster, convert to DRA, then start scheduling workloads again. If the "ResourceClaim for extended resources" gets created in advance, the pod is locked to being scheduled to nodes which use DRA. The ResourceClaim controller would need to be aware of resource utilization (available resources, running pods) to make a smart decision upfront and then react to scheduling failures by revising that decision. That sounds very complex to me and something that is better handled during scheduling itself. |
That's exactly the problem when creating the ResourceClaim in the kube-controller-manager, because there the controller has to make predictions. In the scheduler it's not a prediction, it's based on the analysis of the current state of the cluster at the time of scheduling. Or did you mean "predict future changes around the DRA design"? I'm not worried about that, the design of this KEP seems consistent to me. The ResourceClaim has two purposes, user intent and communicating allocations to the kubelet and other components which need to track resource usage (including the scheduler itself). This KEP only uses the second half, but that seems fine. It is normal in Kubernetes that API objects are created automatically to enact some other user-facing API. I understand that you are worried about the complexity that this adds to the scheduler, but IMHO that's still the best solution. The complexity doesn't go away by moving it somewhere else... |
It still satisfies, there are two equivalent approaches we should be choosing from:
The second option requires adding and removing the object whenever the scheduling decision changes. By scheduling decision I mean reserving resources that is reflected in api-server by placing ReservedFor or NominatedNodeName in allocatable object. Resource nomination concepts are not used extensively yet, but will be used more and more, so operating on preexisting object (even if it's sometimes not used because built-in resources satisfy it) would be much simpler. |
Jumping late to the the discussion - I tend to agree with @dom4ha but really only the last comment is explaining the real motivation. I think that GC and overall lifecycle is not a compelling reason - as @pohly wrote, moving the logic from one place to the other doesn't necessary reduce the complexity. (And it should be possible to use ownerRef to make the lifecycle management not very hard). But I think the nomination concept that can be used across to communicate between different schedulers and if changing decisions requires creation/deletion of additional objects and we may in fact be going back-and-forth if the decision changes, this doesn't sound very compelling (xref: #5287 ) I think the pattern that Dominik proposed might not have been clear from the beginning. IIUC, what he is proposing that there is no need for prediction of whether the node with dra-driver or device-plugin will be chosen during RC creation. Instead of that, the proposal is to introduce additional bit of information in the ResourceClaim - let's temporarily call it I agree this approach is not perfect either - but I would like to understand better what are the drawbacks of it if we don't want to proceed with it. |
My main concern is that it depends on extending the ResourceClaim API in non-trivial ways. This new API will be visible to the user, which then raises the question how they should or shouldn't be allowed to use it. My other concern is that the controller has two choices:
Either way, the scheduler still has to check for "do I need a ResourceClaim for this extended resource" and potentially wait, otherwise scheduling races with the ResourceClaim creation. This seems like a lot of additional effort and complexity to avoid a Create call in the scheduler in a place where it currently already does an UpdateStatus. This simply doesn't seem worth it to me. |
Both options can be combined together. Controller creates ResourceClaim with |
@dom4ha: that is my second option ("Create them conditionally based on DeviceClass settings...").
"Just" downplays the complexity involved in this check. How does the scheduler know based on which DeviceClasses the ResourceClaim was created? How can it be sure that the DeviceClass(es) haven't been replaced or modified since then, if that influences the existence or content of the ResourceClaim? All that the scheduler gets out of this is that it doesn't need to create the ResourceClaim, which might not even be needed. |
Isn't the DeviceClass name a part of such a special ResourceClaim? Since it is, the DeviceClass current state should determine which extended resource (or other built-in resource) allocation can alternatively satisfy the claim (leave it unassigned). It's expected the extended resource or builtin resource is really specified by the pod, but if there's any mismatch, such a ResourceClaim can be ignored and only the extended resource from the PodSpec can be allocated.
Yes, these two approaches should be more or less equivalent, but IMO they make a difference for scheduling. |
It's hard to come up with a perfect solution if we need to mix two different concepts together. Always one of them will be counterintuitive:
As the KEP says, the solution may stay with us for longer, so we cannot expect we'd be able to clean it up soon. I also would like to understand drawbacks of the two options before we can take a decision. |
Not as describe in this KEP. The KEP describes requests, but not how they relate to DeviceClasses because it doesn't matter. Adding this mapping and perhaps other fields like UID and Generation of the referenced DeviceClasses would have to be added. This brings me back to "This new API will be visible to the user, which then raises the question how they should or shouldn't be allowed to use it.". |
Depending how you define "not visible to the user". It seems to me that both are not visible in a sense that user does not need to do anything to start using DRA but
I think that the second option is more verbose to those users who try to debug what happened during scheduling. |
Not quite. The special ResourceClaim in this KEP is a fully-formed, valid ResourceClaim including a spec. That aside, my concern is that if the new fields get added to the ResourceClaimSpec, users may be tempted to set them when using DRA "normally". Do you envision them in the spec or in the status?
With the current proposal, they get that from the pod status. I don't see how this "optionally allocated ResourceClaim" improves upon that. |
As discussed earlier, this KEP does not need to use the claim to specify intent, as the intent is already specified by the pod.spec.resources.requests (e.g. example.com/gpu: 1). the claim is created to hold the allocation results that is then consumed by the kubelet. The proposed alternative seems to me (IIUC) the claim is created to specify the intent, in that case, there are two intents, one is specified in the resourceclaim, the other is given in the pod.spec.resources.requests, then the schduler has to somehow understand these two intents actaully mean the same one, which is not needed when using the proposal in this KEP. The other key difference between the two approaches is when the claim is created. This KEP proposes Just-in-time creation when it is absoutely needed, no more, no less. The alternative proposes creating the claim based on some static analysis of the cluster state (pod, device class, maybe node also, etc). As the scheduler has the most, best information (which it uses to make the scheduling/allocation decision), this lazy claim creation, pushing it to the scheduler, is better (IMO) than shifting it to earlier, which may not be necessary. scheduler does create binding objects today, https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/binding-v1/ It is also a common pattern to create some objects (e..g replicaset) automaticaly by some controllers (deployment controller) with all that said, IIUC, the alternative (resource nomination) is also in active discussion, IMO, we should wait for it to finalize, and getting more mature. We could come back to reevaluate it before this KEP goes to Beta, at that time, I hope we have more information on both (resource nomination, and DRA extended resource), to make more informed decision. |
I second that. The required API changes in the current KEP revision (pod status to record the final mapping of extended resources to DRA devices) are very likely to be also needed when moving the creation of the ResourceClaim, so we are not on a wrong path. Moving the creation may need additional API changes, but we can discuss those when needed. |
let's not mix the binding here - binding isn't really an object itself - we don't store any bindings in etcd or anything like that. What it translates to is setting the
Right, but there is the necessary synchronization between different calls from a given component is something we should keep in mind. Which may visibly affect performance.
I guess I personally buy this. I agree that no matter who creates that claim it will have to be recorded in the pod status and existing API doesn't allow for that (in theory we could relax that that statuses may contain something from outside of the spec, but that doesn't sound like the best option). So we need a new field to reflect that in the status - which is part of this proposal. But I'm not a decision maker here - so will see what @dom4ha will say. |
We want to unify the calls - and this KEP may actually introduce the dependency between different calls. May not be needed for this KEP, but may have consequences on how/what can be done there. |
It's more than that, as all api calls will be abstracted to some The difference between object object creation and update has some subtle differences even for in-memory representation updates (in Reserve). It does not matter now, but will matter in the future once we develop combinatory algorithms which allocate and deallocate DRA resources all the time (insert vs update operation). But both options are doable, so I can't say that there are strong arguments for one option vs the other. I'm rather trying to discuss what is better in terms of more straightforward approach with less hidden logic. In my mind the main problem the KEP needs to address is a case when 100% nodes are DRA (the ResourceClaim will be always created), but users keep using simpler extended resources semantic for various reasons (maybe forever). In such case, we don't have to optimize for the case where only half nodes are DRA, but how to translate extended resource into DRA ResourceClaim (maybe at some point scheduler won't have fit plugin anymore). I suspect we will work in the future on making the translation logic configurable and mutable over time. Then the question will be whether a translation logic should be determined (capture into RC) at workload creation time and used consistently for the whole workload lifetime, or reassessed on each rescheduling. Yes, we are still far from implementing workload-awareness in scheduler and pod rescheduling, so it's hard to use it as argument, and I don't mind leaving the proposed approach for alpha in such case, unless @sanposhiho or @macsko or others have other strong arguments in this discussion. |
Thanks for the clarification.
I noticed that the KEP currently only mentions "Enable cluster administrators to transition to DRA gradually at their own pace, possibly one node a time." under motivation. This a real problem in practice when you consider large clusters which need to do a live migration. @yliaog: perhaps make this more obvious by adding "Efficiently support mixed clusters where some nodes use device plugins and some nodes use DRA drivers for the same hardware." as goal? |
It's clearly one of the goals (it's listed one of the three motivations) and I have never suggested to not address it. However in my mind the cluster migration takes days/weeks, while transforming simplified spec to ResourceClaims will be needed for months/years, so I'm asking which of the two use cases should be driving the design options (assuming they should be equivalent and differ mostly in the special ResourceClaim visibility in the migration scenario only). So, once a cluster administrator migrated all the nodes to DRA, shall the scheduler still attempt to allocate extended resource and even still run the relevant plugin? In the alternative option, the generated If you think that the migration period use case has the priority (over the user specs migration) and it justifies the cost of introducing dynamically created ResourceClaim, I accept that as I still may be missing good understanding of the priorities. |
…e claim, and clarified support for mixed device plugin & DRA
The following commit added critera for graduation to beta, and also clarified the support for mixed nodes. |
@sanposhiho or @macsko do you have any concern on this KEP? |
@sanposhiho or @macsko @dom4ha friendly ping ... |
@mrunalp could you please take a look at this PR? |
Add new KEP for supporting extended resource requests in DRA
DRA: Handle extended resource requests via DRA Driver #5004
kubelet and scheduler for extended resource backed by DRA kubernetes#130653