[RayJob] Support suspend for RayJobs using clusterSelector by DavidAdaRH · Pull Request #4850 · ray-project/kuberay

DavidAdaRH · 2026-05-20T14:09:17Z

Why are these changes needed?

When a RayJob targets an existing RayCluster via clusterSelector, the suspend operation is currently blocked at validation time with: "The ClusterSelector mode doesn't support the suspend operation". This was an intentional edge case in the original suspend implementation, since the operator cannot delete a cluster it does not own.

However, suspend does not need to mean cluster deletion. For clusterSelector jobs, the operator can stop the running Ray job via the Ray Dashboard API (POST /api/jobs/{job_id}/stop), leaving the cluster intact for other workloads or for resume. This is the same dashboard client the operator already uses for job submission and status polling.

This PR:

Removes the validation guard that blocked clusterSelector + suspend combinations in ValidateRayJobSpec.
Adds a clusterSelector-specific suspend path in the Suspending/Retrying case block of the RayJob controller. When a clusterSelector RayJob enters the Suspending state, the controller:

Calls rayDashboardClient.StopJob() to stop the running Ray job via the dashboard API
Deletes the submitter Kubernetes Job (if any)
Sets JobStatus to STOPPED and JobDeploymentStatus to Suspended
Clears JobId, Message, Reason, and RayJobStatusInfo to prevent stale metadata on resume
Preserves DashboardURL and RayClusterName so the controller can re-submit to the same cluster on resume
Breaks out before the existing cluster-deletion path (which remains unchanged for owned clusters)

Handles the cluster-gone case gracefully: if the RayCluster has been externally deleted before or during suspend, the controller logs the absence and completes the suspend transition without getting stuck in Suspending.

No changes are needed to the resume path. The existing Suspended -> New -> re-submission flow handles clusterSelector jobs correctly once JobId is cleared — the controller re-discovers the cluster by name and submits a fresh job.

Related issue number

Closes #4740

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Unit tests

Two new test cases added in rayjob_controller_suspended_test.go:

Happy path (cluster exists): RayJob in Suspending state with a live RayCluster. Asserts StopJob is called via the dashboard client, status transitions to Suspended/STOPPED, JobId is cleared, DashboardURL and RayClusterName are preserved.
Cluster-gone path: RayJob in Suspending state with no RayCluster present. Asserts no panic, dashboardClientFunc is NOT called, status transitions to Suspended/STOPPED.

Validation test updated: the existing test case in validation_test.go that expected an error for clusterSelector + suspend now expects success.

Manual E2E verification (OpenShift / ROSA)

Deployed a custom operator build to an OpenShift cluster and verified the full lifecycle:

Created a RayCluster (test-cluster) with head + worker
Created a RayJob with clusterSelector: {"ray.io/cluster": "test-cluster"} and a long-running entrypoint
Confirmed job was RUNNING (job ID test-suspend-job-bwbdj) in both the RayJob status and the Ray Dashboard
Set suspend: true — operator called StopJob, status transitioned to Suspended/STOPPED, job ID cleared, cluster remained running
Set suspend: false — operator re-submitted a new job to the same cluster with a fresh job ID (test-suspend-job-997p5), status transitioned back to Running/RUNNING

Operator log excerpt showing the suspend transition:

"old JobStatus":"RUNNING","new JobStatus":"STOPPED","old JobDeploymentStatus":"Suspending","new JobDeploymentStatus":"Suspended"

Operator log excerpt showing the resume transition:

"old JobStatus":"STOPPED","new JobStatus":"","old JobDeploymentStatus":"Suspended","new JobDeploymentStatus":""

rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 16956f2. Configure here.}

cursor · 2026-05-20T14:15:40Z

+			rayJobInstance.Status.Message = ""
+			rayJobInstance.Status.Reason = ""
+			rayJobInstance.Status.RayJobStatusInfo = rayv1.RayJobStatusInfo{}
+			break


Retrying state incorrectly transitions to Suspended for clusterSelector

Low Severity

The new clusterSelector path unconditionally sets JobDeploymentStatus = Suspended for both the Suspending and Retrying cases. The existing non-clusterSelector path (lines 430–435) correctly differentiates: Suspending → Suspended, Retrying → New. If a clusterSelector RayJob ever reaches the Retrying state, this code would incorrectly transition it to Suspended instead of New, preventing the retry. Currently unreachable because validation blocks BackoffLimit > 0 for clusterSelector, but this is a latent correctness issue if that validation is ever relaxed.

Additional Locations (1)

ray-operator/controllers/ray/rayjob_controller.go#L429-L435

^{Reviewed by Cursor Bugbot for commit 16956f2. Configure here.}

win5923 · 2026-05-21T17:37:42Z

+				return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil
+			}
+
+			rayJobInstance.Status.JobStatus = rayv1.JobStatusStopped


Suggested change

rayJobInstance.Status.JobStatus = rayv1.JobStatusStopped

rayJobInstance.Status.JobStatus = rayv1.JobStatusNew

JobStatus mirrors the Ray dashboard's authoritative state, KubeRay should only assign values copied from GetJobInfo or reset to JobStatusNew.

https://docs.ray.io/en/latest/cluster/running-applications/job-submission/doc/ray.job_submission.JobStatus.html#ray.job_submission.JobStatus

In this case though, the clusterSelector path actually calls StopJob on the dashboard API right before this line, so STOPPED reflects the real Ray-level state we just caused. The owned-cluster path uses New because the cluster gets deleted and there's no Ray job state left to mirror. That's why I went with Stopped here — it's still grounded in the dashboard's authoritative state, just set proactively after our own API call rather than from a GetJobInfo poll. Let me know what you think.

pawelpaszki · 2026-05-25T07:00:12Z

+					if err != nil {
+						return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
+					}
+					if err := rayDashboardClient.StopJob(ctx, rayJobInstance.Status.JobId); err != nil {


If a Ray job has already reached a terminal state (SUCCEEDED, FAILED, STOPPED) by the time this suspend path runs — which is a realistic race: the user sets suspend: true just as the job finishes and the controller hasn't polled the updated status yet — the Ray Dashboard API will return an error such as "Job is not in a running state". This code treats that as a hard error and requeues, and since Ray keeps completed jobs in its history indefinitely, the same call will fail on every subsequent reconcile. The RayJob will be permanently stuck in Suspending with no way out short of manual intervention.

I'd suggest inspecting the error from StopJob and treating errors that indicate the job is already in a terminal state as a no-op (the job is effectively stopped, so the suspend semantics are satisfied). Alternatively, you could query the job status via the dashboard client before calling StopJob and skip the call if the job is already terminal.

Thanks for flagging this, Pawel, the race condition you're describing is a real scenario and worth thinking through. I believe that StopJob in dashboard_httpclient.go already accounts for it. When the Ray Dashboard API responds with stopped: false, the client follows up with a GetJobInfo call and checks IsJobTerminal. If the job has already reached a terminal state (STOPPED, SUCCEEDED, FAILED), StopJob returns nil rather than an error — so the suspend transition carries on as normal.

Here's the relevant code:

if !jobStopResp.Stopped { jobInfo, err := r.GetJobInfo(ctx, jobName) if err != nil { return err } if !rayv1.IsJobTerminal(jobInfo.JobStatus) { return fmt.Errorf("failed to stop job: %v", jobInfo) } } return nil

So the RayJob won't get stuck in Suspending in that race — the dashboard client layer already treats it as a no-op.

Emit K8s warning event when StopJob fails during clusterSelector suspend. Adds FailedToStopRayJob event type to match the existing pattern (FailedToCreateRayCluster, FailedToDeleteRayCluster, etc.), making dashboard API failures visible through kubectl describe rayjob. Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: DavidAdaRH <dadamach@redhat.com>

Co-authored-by: Cursor <cursoragent@cursor.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

Support suspend for RayJobs using clusterSelector

16956f2

rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

DavidAdaRH requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners May 20, 2026 14:09

cursor Bot reviewed May 20, 2026

View reviewed changes

win5923 reviewed May 21, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/rayjob_controller.go

win5923 reviewed May 21, 2026

View reviewed changes

pawelpaszki reviewed May 25, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/rayjob_controller_suspended_test.go

DavidAdaRH and others added 2 commits May 25, 2026 11:43

Add test for StopJob error path during clusterSelector suspend

365039e

Co-authored-by: Cursor <cursoragent@cursor.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayJob] Support suspend for RayJobs using clusterSelector#4850

[RayJob] Support suspend for RayJobs using clusterSelector#4850
DavidAdaRH wants to merge 3 commits into
ray-project:masterfrom
DavidAdaRH:RHOAIENG-60157-upstream

DavidAdaRH commented May 20, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 20, 2026

Uh oh!

Uh oh!

win5923 May 21, 2026

Uh oh!

DavidAdaRH May 25, 2026

Uh oh!

pawelpaszki May 25, 2026

Uh oh!

DavidAdaRH May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	rayJobInstance.Status.JobStatus = rayv1.JobStatusStopped
	rayJobInstance.Status.JobStatus = rayv1.JobStatusNew

Conversation

DavidAdaRH commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Unit tests

Manual E2E verification (OpenShift / ROSA)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 20, 2026

Choose a reason for hiding this comment

Retrying state incorrectly transitions to Suspended for clusterSelector

Uh oh!

Uh oh!

win5923 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

DavidAdaRH May 25, 2026

Choose a reason for hiding this comment

Uh oh!

pawelpaszki May 25, 2026

Choose a reason for hiding this comment

Uh oh!

DavidAdaRH May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DavidAdaRH commented May 20, 2026 •

edited

Loading