-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Milestone
Description
What is multi-components workload
Kubernetes workloads like Deployment, StatefulSet, and Pod all consist of one component(one pod template) with one or more replicas.
AI training and Big-data workloads, usually consist of more than one component, each component might have multiple replicas.
What would you like to be added:
Provide fine-grained support for workloads which usually consist of two or more components, like:
- Kubeflow Training/Big-data workloads
- The following APIs are being deprecated since July 2025. See the announcement here.
TFJob(Tensorflow Job), which may consist of multiple components likePS,Worker,Chief,Master, andEvaluator.PyTorchJob which may consist of two components, likeMasterandWorker.MXJob, which may consist of multiple components, likeScheduler,Server,Worker,TunerTracker,TunerServer, andTuner.XGBoostJob, which may consist of two components, likeMasterandWorker.MPIJob, which may consist of two components, likeLauncherandWorker.PaddleJob, which may consist of two components, likeMasterandWorker.
- SparkApplication, which may consist of two components, like
driverandexecutor. - TrainJob, which may consist of two components, like
RuntimeRef(refer to JobSet), and Trainer.
- The following APIs are being deprecated since July 2025. See the announcement here.
- FlinkDeployment, which may consist of two components, like
jobManagerandtaskManager. - Volcano Job, which may consist of multiple tasks.
- RayJob which contains
HeadGroupandWorkerGroup. - etc
User story
As a user, I want to deploy PyTorchJob Job to Karmada, and I hope Karmada schedules it to exactly one member cluster based on the available resources on the member cluster. (See #4049 )
Working Progress
- Proposal part 1: Proposal for multiple pod template support #5085
- Proposal part 2: Refine Multiple Pod Templates Scheduling proposal #6535
- Iteration in release-1.15: [Umbrella] Multi-components workload scheduling - phase I #6641
- Iteration in release-1.16: [Umbrella] Multi-components workload scheduling - phase II #6734
Why is this needed:
- For the FederatedResourceQuota feature, it requires accurate resource usage. (see [Feature] FederatedResourceQuota Enhancement - Phase II #6486 (comment))
References
- CRDs(might be needed when perform the tests)
- Volcano Job
mszacillo, zhzhuang-zju, Vacant2333, wlai2 and seanlaii
Metadata
Metadata
Assignees
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Type
Projects
Status
No status