feat: add `RayCluster.status.readyWorkerReplicas` #1930

davidxia · 2024-02-19T17:33:40Z

A worker Pod is ready if it has a PodCondition with type == Ready and status == True. See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/raycluster_controller_fake_test.go

davidxia · 2024-02-19T19:32:53Z

cc @kevin85421

astefanutti

Thanks. I think that's also useful when the RayCluster is in suspended state, in which case it can take a long time the cluster still reports a suspended state despite it's been unsuspended.

ray-operator/controllers/ray/raycluster_controller.go

kevin85421 · 2024-02-22T17:55:49Z

I will review this PR because we are approaching the release of KubeRay v1.1.0, and State is an important part for RayJob. We can revisit this PR after the release to avoid introducing any potential stability issues to the upcoming release.

kevin85421

I prefer not to set the state to rayv1.Unhealthy if not all Pods are running. Both @andrewsykim and @Yicheng-Lu-llll's comments make sense to me.

You can think of RayCluster as a collection of K8s ReplicaSets. In this case, only the suspended ClusterState makes sense to me.

Failed: I may remove it in the future.
Unhealthy: I may remove it in the future. The ReplicaSetStatus doesn't check the Pod commands which belong to the data plane logic. I also consider removing ValidateHeadRayStartParams and offloading the validation to Ray instead.
Ready: In Kubernetes context, Ready means that the K8s resource is ready to serve requests. However, in KubeRay context, we require two different definitions for Ready: (1) Whether the head Pod is ready for RayService to create Ray Serve applications or to check the job/serve app status. (2) Whether all Ray Pods are ready for RayJob to submit the job.
- This PR can remove Ready from the ClusterState, and then (1) add a field in HeadInfo to indicate whether the head Pod is ready or not. (2) add a new field (ReadyWorkerReplicas int32) in RayClusterStatus which is similar to ReadyReplicas in ReplicaSetStatus.

kevin85421 · 2024-04-03T23:44:16Z

This PR can remove Ready from the ClusterState, and then (1) add a field in HeadInfo to indicate whether the head Pod is ready or not. (2) add a new field (ReadyWorkerReplicas int32) in RayClusterStatus which is similar to ReadyReplicas in ReplicaSetStatus.

After rethinking this, we can narrow down the scope of this PR a bit to accelerate the merging process. We can add only ReadyWorkerReplicas int32 in this PR, and remove Ready and add a new field in HeadInfo in the follow-up PRs.

kevin85421 · 2024-04-05T09:18:54Z

Narrow down the scope of this PR to add ReadyWorkerReplicas int32. I have already started working on some items mentioned in #1930 (review) (ex: #2068).

andrewsykim · 2024-04-05T13:55:17Z

We can add only ReadyWorkerReplicas int32 in this PR, and remove Ready and add a new field in HeadInfo in the follow-up PRs.

Is it safe to remove the Ready state? I imagine a lot of people depend on this status already and it would break compatibility

kevin85421 · 2024-04-05T20:44:47Z

Is it safe to remove the Ready state? I imagine a lot of people depend on this status already and it would break compatibility

Good point. I suspect that the RayCluster status is rarely used by users because the ClusterState is not well-defined, yet almost no users have complained about it.

andrewsykim · 2024-04-07T01:21:11Z

I suspect that the RayCluster status is rarely used by users because the ClusterState is not well-defined, yet almost no users have complained about it.

I agree that it's not well defined, but I don't think that means users are not using it.

The reality is that many users will not open issues or complain about something even if they use it. @davidxia opening this PR to change the behavior is one signal that users depend on this field already. If I had to guess, there's probably a lot of (not public) automation that creates a RayCluster and then waits for the .status.state to be ready. I personally think it would be quite disruptive to users to remove this state.

kevin85421 · 2024-04-08T17:32:35Z

The reality is that many users will not open issues or complain about something even if they use it. @davidxia opening this PR to change the behavior is one signal that users depend on this field already. If I had to guess, there's probably a lot of (not public) automation that creates a RayCluster and then waits for the .status.state to be ready. I personally think it would be quite disruptive to users to remove this state.

Your point makes sense to me. I am currently considering the tradeoff:

Option 1: Change the definition of ready (e.g. this PR)
- This PR changes the definition of ready, which is also backward incompatible. Some automation processes may pass, but others may fail and be hard to debug because this change could represent an "unknown unknown" for users.
Option 2: Remove ready and adopt the solution mentioned in feat: add RayCluster.status.readyWorkerReplicas #1930 (review) instead.
- All related automation processes will fail, but it is easy for users and us to troubleshoot.

We can also wait for one or two months to see if users report any issues with the breaking changes in RayJob status. In KubeRay v1.1.0, almost all RayJob statuses have been redefined. If only a few users raise concerns about this, we may be able to infer that using CR status is not common among users.

andrewsykim · 2024-04-08T23:18:11Z

This PR changes the definition of ready, which is also backward incompatible. Some automation processes may pass, but others may fail and be hard to debug because this change could represent an "unknown unknown" for users.

Good point that this is also considered a breaking change.

I think the technically correct change would be to leave the meaning of .status.state = ready unchanged and add new fields in HeadInfo that reflect the new state we want to reflect. However, I consider this PR to be fixing a major bug in the ready definition. A RayCluster that previously had all pods running but now no longer have any pods should definitely not be considered ready. Most users that depend on the ready state probably expected it to cover this scenario anyways, so I think changing the current definition of .status.state = ready is actually okay to do in this particular case. I think it's unlikely to break automation significantly and may catch issues that previously were not captured in the old behavior

kevin85421 · 2024-04-09T05:16:09Z

Most users that depend on the ready state probably expected it to cover this scenario anyways, so I think changing the current definition of .status.state = ready is actually okay to do in this particular case.

I would say this is a breaking change. The RayJob before KubeRay v1.1.0 will have a lot of issues with this PR's definition of ready. That's why I don't want to merge this PR before the release. The definition of ready is confusing to some KubeRay developers, who should have a deeper understanding for the benefit of the users. KubeRay has accumulated a lot of technical debt in its early stages. The status of RayCluster is one issue that we must fix.

We can update HeadInfo at first.
Wait for one month to gather user feedback on RayJob v1.1.0. This will give us a better understanding of whether the status is important to users.
I will remove unhealthy.

davidxia · 2024-04-29T16:59:32Z

@kevin85421 I updated this PR to only add RayCluster.status.readyWorkerReplicas. Lmk if it looks good.

andrewsykim · 2024-04-30T14:45:43Z

ray-operator/controllers/ray/raycluster_controller.go

@@ -1222,6 +1231,7 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra
 		return nil, err
 	}

+	newInstance.Status.ReadyWorkerReplicas = utils.CalculateReadyReplicas(runtimePods)


Just for my own understanding, we can't use the existing AvailableWorkerReplicas field because that only counts "running" pods and not necessarily "ready" pods, is that correct?

A worker Pod is ready if it has a PodCondition with type == Ready and status == True. See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions

kevin85421 · 2024-05-05T02:35:00Z

ray-operator/controllers/ray/raycluster_controller.go

+		oldStatus.AvailableWorkerReplicas != newStatus.AvailableWorkerReplicas ||
+		oldStatus.DesiredWorkerReplicas != newStatus.DesiredWorkerReplicas ||
+		oldStatus.MinWorkerReplicas != newStatus.MinWorkerReplicas ||
+		oldStatus.MaxWorkerReplicas != newStatus.MaxWorkerReplicas {
 		logger.Info("inconsistentRayClusterStatus", "detect inconsistency", fmt.Sprintf(


Nit: KubeRay introduced structured logging in KubeRay v1.1.0, which logs messages in JSON format. Hence, we are trying to reduce the use of fmt.Sprintf. Instead, we want to use key-value pairs. For example,

logger.Info(""inconsistentRayClusterStatus", "detect inconsistency", $KEY1, $VAL1, $KEY2, $VAL2, ...)

so that the logging tools can extract the information from the JSON easily. This is not the scope of this PR. It is not necessary to update this in this PR.

kevin85421 · 2024-05-05T02:37:47Z

ray-operator/controllers/ray/utils/util.go

+func CalculateReadyReplicas(pods corev1.PodList) int32 {
+	count := int32(0)
+	for _, pod := range pods.Items {
+		if val, ok := pod.Labels["ray.io/node-type"]; !ok || val != string(rayv1.WorkerNode) {


Use RayNodeTypeLabelKey instead?

kevin85421 · 2024-05-05T02:44:33Z

ray-operator/controllers/ray/utils/util.go

+		if val, ok := pod.Labels["ray.io/node-type"]; !ok || val != string(rayv1.WorkerNode) {
+			continue
+		}
+		for _, cond := range pod.Status.Conditions {


Use IsRunningAndReady instead?

kuberay/ray-operator/controllers/ray/utils/util.go

Line 72 in c099de4

func IsRunningAndReady(pod *corev1.Pod) bool {

kevin85421

I will merge this PR, and ask others to address:

kevin85421 · 2024-05-07T22:24:01Z

@MortalHappiness will take #1930 (review).

davidxia force-pushed the patch7 branch from 744f08e to 9d97773 Compare February 19, 2024 17:34

davidxia commented Feb 19, 2024

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/raycluster_controller_fake_test.go Outdated Show resolved Hide resolved

davidxia changed the title ~~Patch7~~ fix: change RayCluster .status.state to unhealthy if not all Pods running Feb 19, 2024

davidxia force-pushed the patch7 branch from 9d97773 to 1788318 Compare February 19, 2024 17:40

davidxia marked this pull request as ready for review February 19, 2024 19:32

astefanutti approved these changes Feb 20, 2024

View reviewed changes

andrewsykim reviewed Feb 21, 2024

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

kevin85421 self-assigned this Feb 22, 2024

kevin85421 self-requested a review February 22, 2024 18:55

kevin85421 reviewed Apr 3, 2024

View reviewed changes

kevin85421 mentioned this pull request Apr 8, 2024

[RayCluster][Status][1/n] Remove ClusterState Unhealthy #2068

Merged

4 tasks

davidxia force-pushed the patch7 branch from 1788318 to 881b770 Compare April 29, 2024 16:57

davidxia changed the title ~~fix: change RayCluster .status.state to unhealthy if not all Pods running~~ feat: add RayCluster.status.readyWorkerReplicas Apr 29, 2024

andrewsykim reviewed Apr 30, 2024

View reviewed changes

feat: add RayCluster.status.readyWorkerReplicas

8a74296

A worker Pod is ready if it has a PodCondition with type == Ready and status == True. See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions

davidxia force-pushed the patch7 branch from 881b770 to 8a74296 Compare May 1, 2024 14:42

Yicheng-Lu-llll mentioned this pull request May 3, 2024

Add HeadInfo.Ready status for RayCluster #2115

Closed

4 tasks

kevin85421 reviewed May 5, 2024

View reviewed changes

kevin85421 approved these changes May 7, 2024

View reviewed changes

kevin85421 merged commit 7b3adca into ray-project:master May 7, 2024

MortalHappiness mentioned this pull request May 8, 2024

[Refactor] Follow-up for PR 1930 #2124

Merged

4 tasks

davidxia deleted the patch7 branch May 9, 2024 14:30

andrewsykim mentioned this pull request Jun 12, 2024

[Bug] KubeRay cluster resource status is reporting Ready when there are pods still pending #2188

Closed

2 tasks

feat: add RayCluster.status.readyWorkerReplicas #1930

feat: add RayCluster.status.readyWorkerReplicas #1930

Uh oh!

Conversation

davidxia commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checks

Uh oh!

Uh oh!

Uh oh!

davidxia commented Feb 19, 2024

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevin85421 commented Feb 22, 2024

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented Apr 3, 2024

Uh oh!

kevin85421 commented Apr 5, 2024

Uh oh!

andrewsykim commented Apr 5, 2024

Uh oh!

kevin85421 commented Apr 5, 2024

Uh oh!

andrewsykim commented Apr 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin85421 commented Apr 8, 2024

Uh oh!

andrewsykim commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin85421 commented Apr 9, 2024

Uh oh!

davidxia commented Apr 29, 2024

Uh oh!

andrewsykim Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

davidxia Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 5, 2024

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 5, 2024

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 5, 2024

Choose a reason for hiding this comment

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented May 7, 2024

Uh oh!

Uh oh!

feat: add `RayCluster.status.readyWorkerReplicas` #1930

feat: add `RayCluster.status.readyWorkerReplicas` #1930

davidxia commented Feb 19, 2024 •

edited

Loading

andrewsykim commented Apr 7, 2024 •

edited

Loading

andrewsykim commented Apr 8, 2024 •

edited

Loading