Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Plain Pods with Known OwnerReferences #4106

Open
3 tasks
Tracked by #4249
BenHinthorne opened this issue Jan 30, 2025 · 12 comments
Open
3 tasks
Tracked by #4249

Add Support for Plain Pods with Known OwnerReferences #4106

BenHinthorne opened this issue Jan 30, 2025 · 12 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@BenHinthorne
Copy link

What would you like to be added:
I would like to add the ability for pods annotated with kueue.x-k8s.io/standalone to be suspended even if their owner is a known Kueue integration (open to suggestions one exact annotation wording).

Currently, in the pod_webhook, if the pod has an ownerReference of a Kind known to Kueue, Kueue will not suspend this pod and a workload will never be created for it. I would like to add the ability to bypass this check, and allow the pod to be suspended and have a workload be created for it, if the user opts in via an annotation.

Why is this needed:

At a high level, this would give users more flexibility to process plain pods with Kueue. For example, let's consider Spark Applications, which don't yet have native support within Kueue. In short, Spark Apps operate with a Driver pod, which then creates Executor pods. The Executor owner is the Driver pod, and thus of Kind Pod. Currently, the ownerReference check mentioned above prevents the user from being able to have Kueue process the Executors at all. If users could add the standalone annotation to the Executor pods, then they would be able to use Kueue to manage Spark Apps via plain pods. The same concept could extend to any CRD not yet natively integrated for Kueue, that may follow a similar pattern of Pods owning Pods.

Similarly, this option would also give users the flexibility to process known workload types as plain pods, should they choose to do so. This flexibility would be helpful for users to be able to process dynamically sized jobs (at least until https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs#implementation-history is fully supported), as well as give them the choice if they would like gang scheduling behavior or not. For example, let's consider RayJobs. Currently, RayJobs will be processed by Kueue as a single workload, and any pods resulting from the job autoscaling aren't yet supported. If users could allow for pods owned by a RayJob to be processed as plain pods, then they could choose to process RayJobs as single pods (forgoing gang scheduling), and process any dynamically scaled pods in the same way.

Completion requirements:

This enhancement requires the following artifacts:

Will link an example PR in the comments, and wait for feedback before proceeding with any doc updates.

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@BenHinthorne BenHinthorne added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 30, 2025
@BenHinthorne
Copy link
Author

I created a draft PR for how this would work here: #4109

I have tested this in my local env!

@kannon92
Copy link
Contributor

I would think a KEP is worth going through on this one.

There is parallel work going on to support SparkApplications from @everpeace. Is there is support for Spark do you need this feature?

@KPostOffice
Copy link
Contributor

When an owner is a pod, can it ever be the case that we don't want Kueue to manage the child object? I might be missing something here, but if a pod is creating workloads then is it ever truly a owner in the sense as the other integrations? Could we remove pods from the list of integrations that we care about when checking whether the owner is Kueue managed?

@everpeace
Copy link

everpeace commented Jan 31, 2025

There is parallel work going on to support SparkApplications from @everpeace. Is there is support for Spark do you need this feature?

Thanks for pinging. Although it can be a bit abuse (I'm not sure, sorry), as far as I know, Kueue already have kueue.x-k8s.io/pod-suspending-parent annotation. I think the below pod will be suspended as expected by this line:

func (w *PodWebhook) Default(ctx context.Context, obj runtime.Object) error {
pod := FromObject(obj)
log := ctrl.LoggerFrom(ctx).WithName("pod-webhook")
log.V(5).Info("Applying defaults")
_, suspend := pod.pod.GetAnnotations()[SuspendedByParentAnnotation]

kind: Pod
metadata:
  name: a-pod-owned-by-another-pod
  annotations:
    kueue.x-k8s.io/pod-suspending-parent: pod
  labels:
    kueue.x-k8s.io/queue-name: queue-1
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: a-owner-pod
    uid: xxxxxx
spec:
  ...

For spark case, in my kueue.x-k8s.io/pod-group-sets PR(#4102), I've tested locally pod-group-sets feature by putting kueue.x-k8s.io/pod-suspending-parent annotation to executor pods like below, and it works as expected. I will work on SparkApplication integration as a next step.

./bin/spark-submit \
    --master k8s://https://127.0.0.1:58900 \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
...
    --conf spark.kubernetes.driver.annotation.kueue.x-k8s.io/pod-group-total-count=3 \
    --conf spark.kubernetes.driver.annotation.kueue.x-k8s.io/pod-group-sets='[{"count":1,"name":"driver"},{"count":2,"name":"worker","template":..omit..}]' \
    --conf spark.kubernetes.driver.label.kueue.x-k8s.io/queue-name=queue-1 \
    --conf spark.kubernetes.driver.label.kueue.x-k8s.io/pod-group-name=spark-pi \
    --conf spark.kubernetes.driver.label.kueue.x-k8s.io/pod-group-role=driver \
...
    # THIS ONE!
    --conf spark.kubernetes.executor.annotation.kueue.x-k8s.io/pod-suspending-parent=pod \
    --conf spark.kubernetes.executor.annotation.kueue.x-k8s.io/pod-group-total-count=3 \
    --conf spark.kubernetes.executor.label.kueue.x-k8s.io/queue-name=queue-1 \
    --conf spark.kubernetes.executor.label.kueue.x-k8s.io/pod-group-name=spark-pi \
    --conf spark.kubernetes.executor.label.kueue.x-k8s.io/pod-group-role=executor \
    local:///opt/spark/examples/jars/spark-examples.jar 5000

@BenHinthorne
Copy link
Author

There is parallel work going on to support SparkApplications from @everpeace. Is there is support for Spark do you need this feature?

Thanks for pinging. Although it can be a bit abuse (I'm not sure, sorry), as far as I know, Kueue already have kueue.x-k8s.io/pod-suspending-parent annotation. I think the below pod will be suspended as expected by this line:

kueue/pkg/controller/jobs/pod/pod_webhook.go

Lines 137 to 142 in 198d276

func (w *PodWebhook) Default(ctx context.Context, obj runtime.Object) error {
pod := FromObject(obj)
log := ctrl.LoggerFrom(ctx).WithName("pod-webhook")
log.V(5).Info("Applying defaults")

_, suspend := pod.pod.GetAnnotations()[SuspendedByParentAnnotation]
kind: Pod
metadata:
  name: a-pod-owned-by-another-pod
  annotations:
    kueue.x-k8s.io/pod-suspending-parent: pod
labels:
kueue.x-k8s.io/queue-name: queue-1
  ownerReferences:

  • apiVersion: v1
    controller: true
    kind: Pod
    name: a-owner-pod
    uid: xxxxxx
    spec:
    ...
    For spark case, in my kueue.x-k8s.io/pod-group-sets PR(#4102), I've tested locally pod-group-sets feature by putting kueue.x-k8s.io/pod-suspending-parent annotation to executor pods like below, and it works as expected. I will work on SparkApplication integration as a next step.

./bin/spark-submit
--master k8s://https://127.0.0.1:58900
--deploy-mode cluster
--name spark-pi
--class org.apache.spark.examples.SparkPi
...
--conf spark.kubernetes.driver.annotation.kueue.x-k8s.io/pod-group-total-count=3
--conf spark.kubernetes.driver.annotation.kueue.x-k8s.io/pod-group-sets='[{"count":1,"name":"driver"},{"count":2,"name":"worker","template":..omit..}]'
--conf spark.kubernetes.driver.label.kueue.x-k8s.io/queue-name=queue-1
--conf spark.kubernetes.driver.label.kueue.x-k8s.io/pod-group-name=spark-pi
--conf spark.kubernetes.driver.label.kueue.x-k8s.io/pod-group-role=driver
...
# THIS ONE!
--conf spark.kubernetes.executor.annotation.kueue.x-k8s.io/pod-suspending-parent=pod
--conf spark.kubernetes.executor.annotation.kueue.x-k8s.io/pod-group-total-count=3
--conf spark.kubernetes.executor.label.kueue.x-k8s.io/queue-name=queue-1
--conf spark.kubernetes.executor.label.kueue.x-k8s.io/pod-group-name=spark-pi
--conf spark.kubernetes.executor.label.kueue.x-k8s.io/pod-group-role=executor
local:///opt/spark/examples/jars/spark-examples.jar 5000

Ah interesting, thanks for pointing out the kueue.x-k8s.io/pod-suspending-parent parent annotation! I also tested this out on my set up and it worked as expected, with the behavior I was hoping to have with the proposed standalone-pod annotation.

My concern with using this annotation is that I would fear (as you alluded to) that it's not entirely what it's meant for, and so if it's implementation changes down the road, but we are depending on this behavior, we could run into trouble. I tracked down the original PR where pod-suspending-parent was introduced, but tbh I am not 100% sure on the reason this annotation was introduced. The PR description says:

"Adjusts default suspension logic in Deployment/StatefulSet webhooks to correctly handle the case where they
are children of another Kueue-managed Kind."

So maybe the annotation was actually built for this purpose? Allowing for pods that are children of another kueue managed kind to opt into management with the annotation. I'm not 100% sure though, maybe @dgrove-oss (author of original PR) could weigh in? Is the usage of the kueue.x-k8s.io/pod-suspending-parent annotation valid for the use case described in this issue?

@dgrove-oss
Copy link
Contributor

So maybe the annotation was actually built for this purpose? Allowing for pods that are children of another kueue managed kind to opt into management with the annotation.

This is the purpose of the annotation. Currently, I would classify the annotation as an implementation mechanism not as part of the stable API for Kueue. We do need the functionality, so assuming something like it would be available in future releases is probably relatively low risk, but the details could change. Is that fair @mimowo ?

@mimowo
Copy link
Contributor

mimowo commented Feb 12, 2025

Yes, I think we will need the functionality the annotation provides going forward. Since the original introduction to use in the pod_webhook, we also already depend on it for LWS and StatefulSet integrations. I think it is safe to assume it is going to stay.

As a related side note, I was never truly convinced about its name, so we may consider changing it :), some alternative names would be kueue.x-k8s.io/pod-managing-framework. If we want to rename it is probably better now then later, wdyt @dgrove-oss ?

@mimowo
Copy link
Contributor

mimowo commented Feb 12, 2025

Still, IIUC the kueue.x-k8s.io/pod-suspending-parent annotation does not solve the problem initially raised in the issue to allow supporting Spark Without full blown integration. I think there is actually great value in supporting integrations via Pods so that users of Kueue can "help themselves" rather than implement full blown built-in integrations in Kueue, due to maintenance cost of the integrations.

So, my preference is to support something like kueue.x-k8s.io/pod-standalone and unblock the issue. Given the relative ease of implementation I'm happy to consider it for 0.11. Also, this is way easier and risky than https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs (which is out of question for 0.11).

There are some questions:

  1. should it be generalized to other CRDs - then we may prefer to call it kueue.x-k8s.io/standalone. I believe the implementation will be simple, but not sure about use cases. I think some users of JobSet in Kueue will want to mark the individual Jobs as standalone too. So my preference is to make it generic day one (or at least explore it as an option).

  2. related to the use case:

RayJobs will be processed by Kueue as a single workload, and any pods resulting from the job autoscaling aren't yet supported.

IIUC this is a problem only if you need to enable "Ray" integration in your cluster. Otherwise you could just leave it disabled and add the "queue-name" label at the PodTemplate level. If you need to enable the "Ray" integration in your cluster, then the way to go would be to add "queue-name" and "standalone" annotations at the PodTemplate level. Is this correct?

Any opinions @tenzen-y @dgrove-oss? Would you like to drive @BenHinthorne?

@mimowo
Copy link
Contributor

mimowo commented Feb 12, 2025

In the past we introduced some annotations without KEP, and this one would be very simple. OTOH adding a KEP never hurts so, it would be great place to discuss the remaining open questions .

@dgrove-oss
Copy link
Contributor

As a related side note, I was never truly convinced about its name, so we may consider changing it :), some alternative names would be kueue.x-k8s.io/pod-managing-framework. If we want to rename it is probably better now then later

Agree pod-managing-framework is a better name and that we should either rename it pre 0.11 or we have to live with it.

@mimowo
Copy link
Contributor

mimowo commented Feb 13, 2025

Yeah, but pod-suspending-parent is already released in 0.10 so I think we would need transition period of a couple of releases where we support both - pod-suspending-parent being deprecated and supported in "reads" while "pod-managing-framework" would be set in "writes".

I'm ok if you are up to the challenge, but I'm also ok to live with it, maybe just improve the documentation.

@dgrove-oss
Copy link
Contributor

Forgot that we had released it already. I'm fine with improving the docs and living with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants