[Feature] Support suspend for RayJobs using existing clusters (clusterSelector)
Summary
When spec.suspend is set to true on a RayJob that targets an existing RayCluster via clusterSelector, the suspend flag is silently ignored and the job continues running inside the cluster. This was an intentional edge case in the original suspend implementation (#926), but it creates a confusing user experience and limits the usefulness of suspend beyond Kueue resource management.
Current Behavior
- RayJob with owned cluster (embedded
rayClusterSpec): suspend: true works as expected — the operator deletes the RayCluster, stops pods, frees resources, and sets JobDeploymentStatus to Suspended.
- RayJob with existing cluster (
clusterSelector): suspend: true is set in the spec but nothing happens. The job continues running inside the Ray cluster. The operator cannot delete a cluster it doesn't own, so it has nothing to act on.
From #926:
Edge case: suspend flag is ignored if a RayJob is submitted against an existing RayCluster instance (matched with ClusterSelector) since we can't delete a RayCluster created by somebody else.
Problem
From a Kubernetes UX perspective, if a user sets suspend: true on a resource and sees it reflected in the spec, they expect it to do something. This is especially confusing when:
- A UI exposes a "Pause" action that sets the field but the job carries on regardless.
- The user wants to pause a job for reasons other than resource management — for example, they notice something wrong mid-run and want to stop, review, adjust parameters, and re-submit.
The original suspend design was built around Kueue's preemption model (freeing cluster resources). But there's a broader end-user use case: pausing/stopping a job running on a shared cluster is a natural workflow that doesn't require tearing down the cluster.
Proposed Solution
When suspend is set to true on a RayJob using clusterSelector, the operator should call the Ray Jobs API to stop the running job instead of attempting cluster deletion.
The Ray Dashboard already exposes a stop endpoint (POST /api/jobs/{job_id}/stop), and the KubeRay operator already has a dashboard HTTP client used for job submission and status polling. The change would add a stop call through the same client.
Suspend (false → true)
In the RayJob controller's reconcile loop, when suspend is detected on a job using an existing cluster:
- Call
POST /api/jobs/{job_id}/stop via the existing dashboard client
- Set
JobDeploymentStatus to Suspended
- Set
JobStatus to STOPPED
Suspend (true → false)
When suspend is lifted:
- Re-submit the job via the dashboard API (same as initial submission)
- Update
Status.JobId with the new job ID
- Set
JobDeploymentStatus to Running
Note: resume means re-submission — the job starts from scratch unless the application implements its own checkpointing. This is consistent with the existing behavior for owned clusters (where the cluster is recreated and the job is re-submitted).
Scope
This is a contained change touching:
rayjob_controller.go — add the stop-via-API path when clusterSelector is used
- Dashboard HTTP client — add a
StopJob method wrapping the existing Ray Jobs API endpoint
- Status handling — ensure status transitions are consistent with the owned-cluster path
Context
[Feature] Support suspend for RayJobs using existing clusters (clusterSelector)
Summary
When
spec.suspendis set totrueon a RayJob that targets an existing RayCluster viaclusterSelector, the suspend flag is silently ignored and the job continues running inside the cluster. This was an intentional edge case in the original suspend implementation (#926), but it creates a confusing user experience and limits the usefulness of suspend beyond Kueue resource management.Current Behavior
rayClusterSpec):suspend: trueworks as expected — the operator deletes the RayCluster, stops pods, frees resources, and setsJobDeploymentStatustoSuspended.clusterSelector):suspend: trueis set in the spec but nothing happens. The job continues running inside the Ray cluster. The operator cannot delete a cluster it doesn't own, so it has nothing to act on.From #926:
Problem
From a Kubernetes UX perspective, if a user sets
suspend: trueon a resource and sees it reflected in the spec, they expect it to do something. This is especially confusing when:The original suspend design was built around Kueue's preemption model (freeing cluster resources). But there's a broader end-user use case: pausing/stopping a job running on a shared cluster is a natural workflow that doesn't require tearing down the cluster.
Proposed Solution
When
suspendis set totrueon a RayJob usingclusterSelector, the operator should call the Ray Jobs API to stop the running job instead of attempting cluster deletion.The Ray Dashboard already exposes a stop endpoint (
POST /api/jobs/{job_id}/stop), and the KubeRay operator already has a dashboard HTTP client used for job submission and status polling. The change would add a stop call through the same client.Suspend (false → true)
In the RayJob controller's reconcile loop, when suspend is detected on a job using an existing cluster:
POST /api/jobs/{job_id}/stopvia the existing dashboard clientJobDeploymentStatustoSuspendedJobStatustoSTOPPEDSuspend (true → false)
When suspend is lifted:
Status.JobIdwith the new job IDJobDeploymentStatustoRunningNote: resume means re-submission — the job starts from scratch unless the application implements its own checkpointing. This is consistent with the existing behavior for owned clusters (where the cluster is recreated and the job is re-submitted).
Scope
This is a contained change touching:
rayjob_controller.go— add the stop-via-API path whenclusterSelectoris usedStopJobmethod wrapping the existing Ray Jobs API endpointContext
suspend#1900Suspending#1798