Skip to content

[RayJob] Support suspend for RayJobs using clusterSelector#4850

Open
DavidAdaRH wants to merge 3 commits into
ray-project:masterfrom
DavidAdaRH:RHOAIENG-60157-upstream
Open

[RayJob] Support suspend for RayJobs using clusterSelector#4850
DavidAdaRH wants to merge 3 commits into
ray-project:masterfrom
DavidAdaRH:RHOAIENG-60157-upstream

Conversation

@DavidAdaRH
Copy link
Copy Markdown

@DavidAdaRH DavidAdaRH commented May 20, 2026

Why are these changes needed?

When a RayJob targets an existing RayCluster via clusterSelector, the suspend operation is currently blocked at validation time with: "The ClusterSelector mode doesn't support the suspend operation". This was an intentional edge case in the original suspend implementation, since the operator cannot delete a cluster it does not own.

However, suspend does not need to mean cluster deletion. For clusterSelector jobs, the operator can stop the running Ray job via the Ray Dashboard API (POST /api/jobs/{job_id}/stop), leaving the cluster intact for other workloads or for resume. This is the same dashboard client the operator already uses for job submission and status polling.

This PR:

  1. Removes the validation guard that blocked clusterSelector + suspend combinations in ValidateRayJobSpec.
  2. Adds a clusterSelector-specific suspend path in the Suspending/Retrying case block of the RayJob controller. When a clusterSelector RayJob enters the Suspending state, the controller:
  • Calls rayDashboardClient.StopJob() to stop the running Ray job via the dashboard API
  • Deletes the submitter Kubernetes Job (if any)
  • Sets JobStatus to STOPPED and JobDeploymentStatus to Suspended
  • Clears JobId, Message, Reason, and RayJobStatusInfo to prevent stale metadata on resume
  • Preserves DashboardURL and RayClusterName so the controller can re-submit to the same cluster on resume
  • Breaks out before the existing cluster-deletion path (which remains unchanged for owned clusters)
  1. Handles the cluster-gone case gracefully: if the RayCluster has been externally deleted before or during suspend, the controller logs the absence and completes the suspend transition without getting stuck in Suspending.

No changes are needed to the resume path. The existing Suspended -> New -> re-submission flow handles clusterSelector jobs correctly once JobId is cleared — the controller re-discovers the cluster by name and submits a fresh job.

Related issue number

Closes #4740

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Unit tests

Two new test cases added in rayjob_controller_suspended_test.go:

  • Happy path (cluster exists): RayJob in Suspending state with a live RayCluster. Asserts StopJob is called via the dashboard client, status transitions to Suspended/STOPPED, JobId is cleared, DashboardURL and RayClusterName are preserved.
  • Cluster-gone path: RayJob in Suspending state with no RayCluster present. Asserts no panic, dashboardClientFunc is NOT called, status transitions to Suspended/STOPPED.

Validation test updated: the existing test case in validation_test.go that expected an error for clusterSelector + suspend now expects success.

Manual E2E verification (OpenShift / ROSA)

Deployed a custom operator build to an OpenShift cluster and verified the full lifecycle:

  1. Created a RayCluster (test-cluster) with head + worker
  2. Created a RayJob with clusterSelector: {"ray.io/cluster": "test-cluster"} and a long-running entrypoint
  3. Confirmed job was RUNNING (job ID test-suspend-job-bwbdj) in both the RayJob status and the Ray Dashboard
  4. Set suspend: true — operator called StopJob, status transitioned to Suspended/STOPPED, job ID cleared, cluster remained running
  5. Set suspend: false — operator re-submitted a new job to the same cluster with a fresh job ID (test-suspend-job-997p5), status transitioned back to Running/RUNNING

Operator log excerpt showing the suspend transition:

"old JobStatus":"RUNNING","new JobStatus":"STOPPED","old JobDeploymentStatus":"Suspending","new JobDeploymentStatus":"Suspended"

Operator log excerpt showing the resume transition:

"old JobStatus":"STOPPED","new JobStatus":"","old JobDeploymentStatus":"Suspended","new JobDeploymentStatus":""

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 16956f2. Configure here.

rayJobInstance.Status.Message = ""
rayJobInstance.Status.Reason = ""
rayJobInstance.Status.RayJobStatusInfo = rayv1.RayJobStatusInfo{}
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying state incorrectly transitions to Suspended for clusterSelector

Low Severity

The new clusterSelector path unconditionally sets JobDeploymentStatus = Suspended for both the Suspending and Retrying cases. The existing non-clusterSelector path (lines 430–435) correctly differentiates: SuspendingSuspended, RetryingNew. If a clusterSelector RayJob ever reaches the Retrying state, this code would incorrectly transition it to Suspended instead of New, preventing the retry. Currently unreachable because validation blocks BackoffLimit > 0 for clusterSelector, but this is a latent correctness issue if that validation is ever relaxed.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 16956f2. Configure here.

Comment thread ray-operator/controllers/ray/rayjob_controller.go
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil
}

rayJobInstance.Status.JobStatus = rayv1.JobStatusStopped
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rayJobInstance.Status.JobStatus = rayv1.JobStatusStopped
rayJobInstance.Status.JobStatus = rayv1.JobStatusNew

JobStatus mirrors the Ray dashboard's authoritative state, KubeRay should only assign values copied from GetJobInfo or reset to JobStatusNew.

https://docs.ray.io/en/latest/cluster/running-applications/job-submission/doc/ray.job_submission.JobStatus.html#ray.job_submission.JobStatus

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case though, the clusterSelector path actually calls StopJob on the dashboard API right before this line, so STOPPED reflects the real Ray-level state we just caused. The owned-cluster path uses New because the cluster gets deleted and there's no Ray job state left to mirror. That's why I went with Stopped here — it's still grounded in the dashboard's authoritative state, just set proactively after our own API call rather than from a GetJobInfo poll. Let me know what you think.

if err != nil {
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
if err := rayDashboardClient.StopJob(ctx, rayJobInstance.Status.JobId); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a Ray job has already reached a terminal state (SUCCEEDED, FAILED, STOPPED) by the time this suspend path runs — which is a realistic race: the user sets suspend: true just as the job finishes and the controller hasn't polled the updated status yet — the Ray Dashboard API will return an error such as "Job is not in a running state". This code treats that as a hard error and requeues, and since Ray keeps completed jobs in its history indefinitely, the same call will fail on every subsequent reconcile. The RayJob will be permanently stuck in Suspending with no way out short of manual intervention.

I'd suggest inspecting the error from StopJob and treating errors that indicate the job is already in a terminal state as a no-op (the job is effectively stopped, so the suspend semantics are satisfied). Alternatively, you could query the job status via the dashboard client before calling StopJob and skip the call if the job is already terminal.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this, Pawel, the race condition you're describing is a real scenario and worth thinking through. I believe that StopJob in dashboard_httpclient.go already accounts for it. When the Ray Dashboard API responds with stopped: false, the client follows up with a GetJobInfo call and checks IsJobTerminal. If the job has already reached a terminal state (STOPPED, SUCCEEDED, FAILED), StopJob returns nil rather than an error — so the suspend transition carries on as normal.

Here's the relevant code:

if !jobStopResp.Stopped {
    jobInfo, err := r.GetJobInfo(ctx, jobName)
    if err != nil {
        return err
    }
    if !rayv1.IsJobTerminal(jobInfo.JobStatus) {
        return fmt.Errorf("failed to stop job: %v", jobInfo)
    }
}
return nil

So the RayJob won't get stuck in Suspending in that race — the dashboard client layer already treats it as a no-op.

Comment thread ray-operator/controllers/ray/rayjob_controller_suspended_test.go
DavidAdaRH and others added 2 commits May 25, 2026 11:43
Emit K8s warning event when StopJob fails during clusterSelector suspend. Adds FailedToStopRayJob event type to match the existing pattern (FailedToCreateRayCluster, FailedToDeleteRayCluster, etc.), making dashboard API failures visible through kubectl describe rayjob.

Co-authored-by: Jun-Hao Wan <ken89@kimo.com>
Signed-off-by: DavidAdaRH <dadamach@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support suspend for RayJobs using existing clusters (clusterSelector)

3 participants