The cluster setup installs and configures the following components:
- Scheduler Plugins
- Kubeflow Training Operator
- KubeRay
- Kueue
- AppWrappers
- Cluster roles and priority classes
- Autopilot
Create default-priority
, high-priority
, and low-priority
priority classes:
kubectl apply -f setup.k8s/mlbatch-priorities.yaml
MLBatch configures Kubernetes scheduling to accomplish two objectives:
- Obtaining gang (all or nothing) scheduling for multi-Pod workloads.
- Packing Pods whose GPU request is less than the number of GPUs on a Node to maximize the number of Nodes available for Pods that request all the GPUs on a Node.
The currently recommend way to do this is by installing the Coscheduling out-of-tree scheduler plugin and configuring the default NodeResourcesFit scheduler plugin to pack in the GPU dimension. Alternatively, you can skip the helm install and patch commands shown below and instead install the experimental Sakkara scheduler plugin (described next).
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
Patch scheduler-plugins pod priorities:
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s/scheduler-priority-patch.yaml scheduler-plugins-controller
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s/scheduler-priority-patch.yaml scheduler-plugins-scheduler
Sakkara is an experimental new scheduler plugin with advanced support for topology-aware scheduling.
Install Sakkara as a secondary scheduler:
helm install sakkara-scheduler --namespace sakkara-scheduler --create-namespace mlbatch/sakkara-scheduler
Optionally, create a config map capturing your cluster's topology as described in the Sakkara documentation. This step is optional but recommended for production clusters. If the config map is not present Sakkara will default to a single-level hierarchy containing the Nodes of the cluster.
Create the mlbatch-system namespace
kubectl create namespace mlbatch-system
Install the Kubeflow Training Operator
If you are using Coscheduling do:
kubectl apply --server-side -k setup.k8s/training-operator/coscheduling
If you are using Sakkara do:
kubectl apply --server-side -k setup.k8s/training-operator/sakkara
Install the KubeRay Operator
kubectl apply --server-side -k setup.k8s/kuberay
Install Kueue
kubectl apply --server-side -k setup.k8s/kueue
Install the AppWrapper Operator If you are using Coscheduling do:
kubectl apply --server-side -k setup.k8s/appwrapper/coscheduling
If you are using Sakkara do:
kubectl apply --server-side -k setup.k8s/appwrapper/sakkara
The provided configuration differs from the default configuration of the operators as follows:
- Kubeflow Training Operator:
gang-scheduler-name
is set to eitherscheduler-plugins-scheduler
orsakkara-scheduler
,
- Kueue:
batch/job
integration is disabled,manageJobsWithoutQueueName
is enabled and configured viamanagedJobsNamespaceSelector
to be scoped to only namespaces that are labeled withmlbatch-team-namespace=true
.waitForPodsReady
is disabled,LendingLimit
feature gate is enabled,fairSharing
is enabled,enableClusterQueueResources
metrics is enabled,
- AppWrapper operator:
userRBACAdmissionCheck
is disabled,schedulerName
is set toscheduler-plugins-scheduler
orsakkara-scheduler
,queueName
is set todefault-queue
,
- pod priorities, resource requests and limits have been adjusted.
Helm charts values and how-to for customization can be found in the official documentation. As-is, Autopilot will run on GPU nodes.
- Add the Autopilot Helm repository
helm repo add autopilot https://ibm.github.io/autopilot/
helm repo update
- Install the chart (idempotent command). The config file is for customizing the helm values and it is optional.
helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f your-config.yml
The ServiceMonitor
object is the one that enables Prometheus to scrape the metrics produced by Autopilot.
In order for Prometheus to find the right objects, the ServiceMonitor
needs to be annotated with the Prometheus' release name. It is usually prometheus
, and that's the default added in the Autopilot release.
If that is not the case in your cluster, the correct release label can be found by checking in the ServiceMonitor
of Prometheus itself, or the name of Prometheus helm chart.
Then, Autopilot's ServiceMonitor
can be labeled with the following command
kubectl label servicemonitors.monitoring.coreos.com -n autopilot autopilot-metrics-monitor release=<prometheus-release-name> --overwrite
Create Kueue's default flavor:
kubectl apply -f setup.k8s/default-flavor.yaml
Create mlbatch-edit
role:
kubectl apply -f setup.k8s/mlbatch-edit-role.yaml
Create the designated slack ClusterQueue
which will be used to automate
minor adjustments to cluster capacity caused by node failures and
scheduler maintanence.
kubectl apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: slack-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
- name: "memory"
nominalQuota: 128Gi
- name: "nvidia.com/gpu"
nominalQuota: 8
- name: "nvidia.com/roce_gdr"
nominalQuota: 1
- name: "pods"
nominalQuota: 100
EOF
Edit the above quantities to adjust the quota to the desired
values. Pod counts are optional and can be omitted from the list of
covered resources. The lendingLimit
for each resource will be
dynamically adjusted by the MLBatch system to reflect reduced cluster
capacity. See QUOTA_MAINTENANCE.md for a
detailed discussion of the role of the slack ClusterQueue
.