[Feature] Support suspend for RayJobs using existing clusters (clusterSelector)

# [Feature] Support suspend for RayJobs using existing clusters (clusterSelector)

## Summary

When `spec.suspend` is set to `true` on a RayJob that targets an existing RayCluster via `clusterSelector`, the suspend flag is silently ignored and the job continues running inside the cluster. This was an intentional edge case in the original suspend implementation (#926), but it creates a confusing user experience and limits the usefulness of suspend beyond Kueue resource management.

## Current Behavior

- **RayJob with owned cluster (embedded `rayClusterSpec`):** `suspend: true` works as expected — the operator deletes the RayCluster, stops pods, frees resources, and sets `JobDeploymentStatus` to `Suspended`.
- **RayJob with existing cluster (`clusterSelector`):** `suspend: true` is set in the spec but nothing happens. The job continues running inside the Ray cluster. The operator cannot delete a cluster it doesn't own, so it has nothing to act on.

From #926:
> Edge case: `suspend` flag is ignored if a RayJob is submitted against an existing `RayCluster` instance (matched with `ClusterSelector`) since we can't delete a `RayCluster` created by somebody else.

## Problem

From a Kubernetes UX perspective, if a user sets `suspend: true` on a resource and sees it reflected in the spec, they expect it to do something. This is especially confusing when:

1. A UI exposes a "Pause" action that sets the field but the job carries on regardless.
2. The user wants to pause a job for reasons other than resource management — for example, they notice something wrong mid-run and want to stop, review, adjust parameters, and re-submit.

The original suspend design was built around Kueue's preemption model (freeing cluster resources). But there's a broader end-user use case: pausing/stopping a job running on a shared cluster is a natural workflow that doesn't require tearing down the cluster.

## Proposed Solution

When `suspend` is set to `true` on a RayJob using `clusterSelector`, the operator should call the Ray Jobs API to stop the running job instead of attempting cluster deletion.

The Ray Dashboard already exposes a stop endpoint (`POST /api/jobs/{job_id}/stop`), and the KubeRay operator already has a dashboard HTTP client used for job submission and status polling. The change would add a stop call through the same client.

### Suspend (false → true)

In the RayJob controller's reconcile loop, when suspend is detected on a job using an existing cluster:

1. Call `POST /api/jobs/{job_id}/stop` via the existing dashboard client
2. Set `JobDeploymentStatus` to `Suspended`
3. Set `JobStatus` to `STOPPED`

### Suspend (true → false)

When suspend is lifted:

1. Re-submit the job via the dashboard API (same as initial submission)
2. Update `Status.JobId` with the new job ID
3. Set `JobDeploymentStatus` to `Running`

Note: resume means re-submission — the job starts from scratch unless the application implements its own checkpointing. This is consistent with the existing behavior for owned clusters (where the cluster is recreated and the job is re-submitted).

### Scope

This is a contained change touching:
- `rayjob_controller.go` — add the stop-via-API path when `clusterSelector` is used
- Dashboard HTTP client — add a `StopJob` method wrapping the existing Ray Jobs API endpoint
- Status handling — ensure status transitions are consistent with the owned-cluster path

## Context

- Original suspend implementation: #926
- Suspend documentation: #1900
- Suspend as atomic operation: #1798
- Resume suspended RayJob: #1783


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support suspend for RayJobs using existing clusters (clusterSelector) #4740

[Feature] Support suspend for RayJobs using existing clusters (clusterSelector)

Summary

Current Behavior

Problem

Proposed Solution

Suspend (false → true)

Suspend (true → false)

Scope

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Support suspend for RayJobs using existing clusters (clusterSelector) #4740

Description

[Feature] Support suspend for RayJobs using existing clusters (clusterSelector)

Summary

Current Behavior

Problem

Proposed Solution

Suspend (false → true)

Suspend (true → false)

Scope

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions