[RayJob] Support suspend for RayJobs using clusterSelector#4850
[RayJob] Support suspend for RayJobs using clusterSelector#4850DavidAdaRH wants to merge 3 commits into
Conversation
rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 16956f2. Configure here.
| rayJobInstance.Status.Message = "" | ||
| rayJobInstance.Status.Reason = "" | ||
| rayJobInstance.Status.RayJobStatusInfo = rayv1.RayJobStatusInfo{} | ||
| break |
There was a problem hiding this comment.
Retrying state incorrectly transitions to Suspended for clusterSelector
Low Severity
The new clusterSelector path unconditionally sets JobDeploymentStatus = Suspended for both the Suspending and Retrying cases. The existing non-clusterSelector path (lines 430–435) correctly differentiates: Suspending → Suspended, Retrying → New. If a clusterSelector RayJob ever reaches the Retrying state, this code would incorrectly transition it to Suspended instead of New, preventing the retry. Currently unreachable because validation blocks BackoffLimit > 0 for clusterSelector, but this is a latent correctness issue if that validation is ever relaxed.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 16956f2. Configure here.
| return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil | ||
| } | ||
|
|
||
| rayJobInstance.Status.JobStatus = rayv1.JobStatusStopped |
There was a problem hiding this comment.
| rayJobInstance.Status.JobStatus = rayv1.JobStatusStopped | |
| rayJobInstance.Status.JobStatus = rayv1.JobStatusNew |
JobStatus mirrors the Ray dashboard's authoritative state, KubeRay should only assign values copied from GetJobInfo or reset to JobStatusNew.
There was a problem hiding this comment.
In this case though, the clusterSelector path actually calls StopJob on the dashboard API right before this line, so STOPPED reflects the real Ray-level state we just caused. The owned-cluster path uses New because the cluster gets deleted and there's no Ray job state left to mirror. That's why I went with Stopped here — it's still grounded in the dashboard's authoritative state, just set proactively after our own API call rather than from a GetJobInfo poll. Let me know what you think.
| if err != nil { | ||
| return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | ||
| } | ||
| if err := rayDashboardClient.StopJob(ctx, rayJobInstance.Status.JobId); err != nil { |
There was a problem hiding this comment.
If a Ray job has already reached a terminal state (SUCCEEDED, FAILED, STOPPED) by the time this suspend path runs — which is a realistic race: the user sets suspend: true just as the job finishes and the controller hasn't polled the updated status yet — the Ray Dashboard API will return an error such as "Job is not in a running state". This code treats that as a hard error and requeues, and since Ray keeps completed jobs in its history indefinitely, the same call will fail on every subsequent reconcile. The RayJob will be permanently stuck in Suspending with no way out short of manual intervention.
I'd suggest inspecting the error from StopJob and treating errors that indicate the job is already in a terminal state as a no-op (the job is effectively stopped, so the suspend semantics are satisfied). Alternatively, you could query the job status via the dashboard client before calling StopJob and skip the call if the job is already terminal.
There was a problem hiding this comment.
Thanks for flagging this, Pawel, the race condition you're describing is a real scenario and worth thinking through. I believe that StopJob in dashboard_httpclient.go already accounts for it. When the Ray Dashboard API responds with stopped: false, the client follows up with a GetJobInfo call and checks IsJobTerminal. If the job has already reached a terminal state (STOPPED, SUCCEEDED, FAILED), StopJob returns nil rather than an error — so the suspend transition carries on as normal.
Here's the relevant code:
if !jobStopResp.Stopped {
jobInfo, err := r.GetJobInfo(ctx, jobName)
if err != nil {
return err
}
if !rayv1.IsJobTerminal(jobInfo.JobStatus) {
return fmt.Errorf("failed to stop job: %v", jobInfo)
}
}
return nil
So the RayJob won't get stuck in Suspending in that race — the dashboard client layer already treats it as a no-op.
Emit K8s warning event when StopJob fails during clusterSelector suspend. Adds FailedToStopRayJob event type to match the existing pattern (FailedToCreateRayCluster, FailedToDeleteRayCluster, etc.), making dashboard API failures visible through kubectl describe rayjob. Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: DavidAdaRH <dadamach@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED


Why are these changes needed?
When a RayJob targets an existing RayCluster via clusterSelector, the suspend operation is currently blocked at validation time with: "The ClusterSelector mode doesn't support the suspend operation". This was an intentional edge case in the original suspend implementation, since the operator cannot delete a cluster it does not own.
However, suspend does not need to mean cluster deletion. For clusterSelector jobs, the operator can stop the running Ray job via the Ray Dashboard API (POST /api/jobs/{job_id}/stop), leaving the cluster intact for other workloads or for resume. This is the same dashboard client the operator already uses for job submission and status polling.
This PR:
No changes are needed to the resume path. The existing Suspended -> New -> re-submission flow handles clusterSelector jobs correctly once JobId is cleared — the controller re-discovers the cluster by name and submits a fresh job.
Related issue number
Closes #4740
Checks
Unit tests
Two new test cases added in rayjob_controller_suspended_test.go:
Validation test updated: the existing test case in validation_test.go that expected an error for clusterSelector + suspend now expects success.
Manual E2E verification (OpenShift / ROSA)
Deployed a custom operator build to an OpenShift cluster and verified the full lifecycle:
Operator log excerpt showing the suspend transition:
"old JobStatus":"RUNNING","new JobStatus":"STOPPED","old JobDeploymentStatus":"Suspending","new JobDeploymentStatus":"Suspended"
Operator log excerpt showing the resume transition:
"old JobStatus":"STOPPED","new JobStatus":"","old JobDeploymentStatus":"Suspended","new JobDeploymentStatus":""