Skip to content

[Feature] Multi-components workload scheduling #5115

@RainbowMango

Description

@RainbowMango

What is multi-components workload
Kubernetes workloads like Deployment, StatefulSet, and Pod all consist of one component(one pod template) with one or more replicas.
AI training and Big-data workloads, usually consist of more than one component, each component might have multiple replicas.

What would you like to be added:
Provide fine-grained support for workloads which usually consist of two or more components, like:

  • Kubeflow Training/Big-data workloads
    • The following APIs are being deprecated since July 2025. See the announcement here.
      • TFJob(Tensorflow Job), which may consist of multiple components like PS, Worker, Chief, Master, and Evaluator.
      • PyTorchJob which may consist of two components, like Master and Worker.
      • MXJob, which may consist of multiple components, like Scheduler, Server, Worker, TunerTracker, TunerServer, and Tuner.
      • XGBoostJob, which may consist of two components, like Master and Worker.
      • MPIJob, which may consist of two components, like Launcher and Worker.
      • PaddleJob, which may consist of two components, like Master and Worker.
    • SparkApplication, which may consist of two components, like driver and executor.
    • TrainJob, which may consist of two components, like RuntimeRef(refer to JobSet), and Trainer.
  • FlinkDeployment, which may consist of two components, like jobManager and taskManager.
  • Volcano Job, which may consist of multiple tasks.
  • RayJob which contains HeadGroup and WorkerGroup.
  • etc

User story
As a user, I want to deploy PyTorchJob Job to Karmada, and I hope Karmada schedules it to exactly one member cluster based on the available resources on the member cluster. (See #4049 )

Working Progress

Why is this needed:

  1. For the FederatedResourceQuota feature, it requires accurate resource usage. (see [Feature] FederatedResourceQuota Enhancement - Phase II #6486 (comment))

References

  • CRDs(might be needed when perform the tests)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions