Skip to content

[Feature] Support suspend for RayJobs using existing clusters (clusterSelector) #4740

@laurafitzgerald

Description

@laurafitzgerald

[Feature] Support suspend for RayJobs using existing clusters (clusterSelector)

Summary

When spec.suspend is set to true on a RayJob that targets an existing RayCluster via clusterSelector, the suspend flag is silently ignored and the job continues running inside the cluster. This was an intentional edge case in the original suspend implementation (#926), but it creates a confusing user experience and limits the usefulness of suspend beyond Kueue resource management.

Current Behavior

  • RayJob with owned cluster (embedded rayClusterSpec): suspend: true works as expected — the operator deletes the RayCluster, stops pods, frees resources, and sets JobDeploymentStatus to Suspended.
  • RayJob with existing cluster (clusterSelector): suspend: true is set in the spec but nothing happens. The job continues running inside the Ray cluster. The operator cannot delete a cluster it doesn't own, so it has nothing to act on.

From #926:

Edge case: suspend flag is ignored if a RayJob is submitted against an existing RayCluster instance (matched with ClusterSelector) since we can't delete a RayCluster created by somebody else.

Problem

From a Kubernetes UX perspective, if a user sets suspend: true on a resource and sees it reflected in the spec, they expect it to do something. This is especially confusing when:

  1. A UI exposes a "Pause" action that sets the field but the job carries on regardless.
  2. The user wants to pause a job for reasons other than resource management — for example, they notice something wrong mid-run and want to stop, review, adjust parameters, and re-submit.

The original suspend design was built around Kueue's preemption model (freeing cluster resources). But there's a broader end-user use case: pausing/stopping a job running on a shared cluster is a natural workflow that doesn't require tearing down the cluster.

Proposed Solution

When suspend is set to true on a RayJob using clusterSelector, the operator should call the Ray Jobs API to stop the running job instead of attempting cluster deletion.

The Ray Dashboard already exposes a stop endpoint (POST /api/jobs/{job_id}/stop), and the KubeRay operator already has a dashboard HTTP client used for job submission and status polling. The change would add a stop call through the same client.

Suspend (false → true)

In the RayJob controller's reconcile loop, when suspend is detected on a job using an existing cluster:

  1. Call POST /api/jobs/{job_id}/stop via the existing dashboard client
  2. Set JobDeploymentStatus to Suspended
  3. Set JobStatus to STOPPED

Suspend (true → false)

When suspend is lifted:

  1. Re-submit the job via the dashboard API (same as initial submission)
  2. Update Status.JobId with the new job ID
  3. Set JobDeploymentStatus to Running

Note: resume means re-submission — the job starts from scratch unless the application implements its own checkpointing. This is consistent with the existing behavior for owned clusters (where the cluster is recreated and the job is re-submitted).

Scope

This is a contained change touching:

  • rayjob_controller.go — add the stop-via-API path when clusterSelector is used
  • Dashboard HTTP client — add a StopJob method wrapping the existing Ray Jobs API endpoint
  • Status handling — ensure status transitions are consistent with the owned-cluster path

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions