Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add RayCluster.status.readyWorkerReplicas #1930

Merged
merged 1 commit into from
May 7, 2024

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Feb 19, 2024

A worker Pod is ready if it has a PodCondition with type == Ready and status == True. See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@davidxia davidxia changed the title Patch7 fix: change RayCluster .status.state to unhealthy if not all Pods running Feb 19, 2024
@davidxia davidxia marked this pull request as ready for review February 19, 2024 19:32
@davidxia
Copy link
Contributor Author

cc @kevin85421

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think that's also useful when the RayCluster is in suspended state, in which case it can take a long time the cluster still reports a suspended state despite it's been unsuspended.

@kevin85421
Copy link
Member

I will review this PR because we are approaching the release of KubeRay v1.1.0, and State is an important part for RayJob. We can revisit this PR after the release to avoid introducing any potential stability issues to the upcoming release.

@kevin85421 kevin85421 self-assigned this Feb 22, 2024
@kevin85421 kevin85421 self-requested a review February 22, 2024 18:55
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to set the state to rayv1.Unhealthy if not all Pods are running. Both @andrewsykim and @Yicheng-Lu-llll's comments make sense to me.

You can think of RayCluster as a collection of K8s ReplicaSets. In this case, only the suspended ClusterState makes sense to me.

  • Failed: I may remove it in the future.
  • Unhealthy: I may remove it in the future. The ReplicaSetStatus doesn't check the Pod commands which belong to the data plane logic. I also consider removing ValidateHeadRayStartParams and offloading the validation to Ray instead.
  • Ready: In Kubernetes context, Ready means that the K8s resource is ready to serve requests. However, in KubeRay context, we require two different definitions for Ready: (1) Whether the head Pod is ready for RayService to create Ray Serve applications or to check the job/serve app status. (2) Whether all Ray Pods are ready for RayJob to submit the job.
    • This PR can remove Ready from the ClusterState, and then (1) add a field in HeadInfo to indicate whether the head Pod is ready or not. (2) add a new field (ReadyWorkerReplicas int32) in RayClusterStatus which is similar to ReadyReplicas in ReplicaSetStatus.

@kevin85421
Copy link
Member

This PR can remove Ready from the ClusterState, and then (1) add a field in HeadInfo to indicate whether the head Pod is ready or not. (2) add a new field (ReadyWorkerReplicas int32) in RayClusterStatus which is similar to ReadyReplicas in ReplicaSetStatus.

After rethinking this, we can narrow down the scope of this PR a bit to accelerate the merging process. We can add only ReadyWorkerReplicas int32 in this PR, and remove Ready and add a new field in HeadInfo in the follow-up PRs.

@kevin85421
Copy link
Member

Narrow down the scope of this PR to add ReadyWorkerReplicas int32. I have already started working on some items mentioned in #1930 (review) (ex: #2068).

@andrewsykim
Copy link
Collaborator

We can add only ReadyWorkerReplicas int32 in this PR, and remove Ready and add a new field in HeadInfo in the follow-up PRs.

Is it safe to remove the Ready state? I imagine a lot of people depend on this status already and it would break compatibility

@kevin85421
Copy link
Member

Is it safe to remove the Ready state? I imagine a lot of people depend on this status already and it would break compatibility

Good point. I suspect that the RayCluster status is rarely used by users because the ClusterState is not well-defined, yet almost no users have complained about it.

@andrewsykim
Copy link
Collaborator

andrewsykim commented Apr 7, 2024

I suspect that the RayCluster status is rarely used by users because the ClusterState is not well-defined, yet almost no users have complained about it.

I agree that it's not well defined, but I don't think that means users are not using it.

The reality is that many users will not open issues or complain about something even if they use it. @davidxia opening this PR to change the behavior is one signal that users depend on this field already. If I had to guess, there's probably a lot of (not public) automation that creates a RayCluster and then waits for the .status.state to be ready. I personally think it would be quite disruptive to users to remove this state.

@kevin85421
Copy link
Member

The reality is that many users will not open issues or complain about something even if they use it. @davidxia opening this PR to change the behavior is one signal that users depend on this field already. If I had to guess, there's probably a lot of (not public) automation that creates a RayCluster and then waits for the .status.state to be ready. I personally think it would be quite disruptive to users to remove this state.

Your point makes sense to me. I am currently considering the tradeoff:

  • Option 1: Change the definition of ready (e.g. this PR)
    • This PR changes the definition of ready, which is also backward incompatible. Some automation processes may pass, but others may fail and be hard to debug because this change could represent an "unknown unknown" for users.
  • Option 2: Remove ready and adopt the solution mentioned in feat: add RayCluster.status.readyWorkerReplicas #1930 (review) instead.
    • All related automation processes will fail, but it is easy for users and us to troubleshoot.

We can also wait for one or two months to see if users report any issues with the breaking changes in RayJob status. In KubeRay v1.1.0, almost all RayJob statuses have been redefined. If only a few users raise concerns about this, we may be able to infer that using CR status is not common among users.

@andrewsykim
Copy link
Collaborator

andrewsykim commented Apr 8, 2024

This PR changes the definition of ready, which is also backward incompatible. Some automation processes may pass, but others may fail and be hard to debug because this change could represent an "unknown unknown" for users.

Good point that this is also considered a breaking change.

I think the technically correct change would be to leave the meaning of .status.state = ready unchanged and add new fields in HeadInfo that reflect the new state we want to reflect. However, I consider this PR to be fixing a major bug in the ready definition. A RayCluster that previously had all pods running but now no longer have any pods should definitely not be considered ready. Most users that depend on the ready state probably expected it to cover this scenario anyways, so I think changing the current definition of .status.state = ready is actually okay to do in this particular case. I think it's unlikely to break automation significantly and may catch issues that previously were not captured in the old behavior

@kevin85421
Copy link
Member

Most users that depend on the ready state probably expected it to cover this scenario anyways, so I think changing the current definition of .status.state = ready is actually okay to do in this particular case.

I would say this is a breaking change. The RayJob before KubeRay v1.1.0 will have a lot of issues with this PR's definition of ready. That's why I don't want to merge this PR before the release. The definition of ready is confusing to some KubeRay developers, who should have a deeper understanding for the benefit of the users. KubeRay has accumulated a lot of technical debt in its early stages. The status of RayCluster is one issue that we must fix.

  • We can update HeadInfo at first.
  • Wait for one month to gather user feedback on RayJob v1.1.0. This will give us a better understanding of whether the status is important to users.
  • I will remove unhealthy.

@davidxia davidxia changed the title fix: change RayCluster .status.state to unhealthy if not all Pods running feat: add RayCluster.status.readyWorkerReplicas Apr 29, 2024
@davidxia
Copy link
Contributor Author

@kevin85421 I updated this PR to only add RayCluster.status.readyWorkerReplicas. Lmk if it looks good.

@@ -1222,6 +1231,7 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra
return nil, err
}

newInstance.Status.ReadyWorkerReplicas = utils.CalculateReadyReplicas(runtimePods)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, we can't use the existing AvailableWorkerReplicas field because that only counts "running" pods and not necessarily "ready" pods, is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

A worker Pod is ready if it has a PodCondition with type == Ready and status ==
True. See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions
oldStatus.AvailableWorkerReplicas != newStatus.AvailableWorkerReplicas ||
oldStatus.DesiredWorkerReplicas != newStatus.DesiredWorkerReplicas ||
oldStatus.MinWorkerReplicas != newStatus.MinWorkerReplicas ||
oldStatus.MaxWorkerReplicas != newStatus.MaxWorkerReplicas {
logger.Info("inconsistentRayClusterStatus", "detect inconsistency", fmt.Sprintf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: KubeRay introduced structured logging in KubeRay v1.1.0, which logs messages in JSON format. Hence, we are trying to reduce the use of fmt.Sprintf. Instead, we want to use key-value pairs. For example,

logger.Info(""inconsistentRayClusterStatus", "detect inconsistency", $KEY1, $VAL1, $KEY2, $VAL2, ...)

so that the logging tools can extract the information from the JSON easily. This is not the scope of this PR. It is not necessary to update this in this PR.

func CalculateReadyReplicas(pods corev1.PodList) int32 {
count := int32(0)
for _, pod := range pods.Items {
if val, ok := pod.Labels["ray.io/node-type"]; !ok || val != string(rayv1.WorkerNode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RayNodeTypeLabelKey instead?

if val, ok := pod.Labels["ray.io/node-type"]; !ok || val != string(rayv1.WorkerNode) {
continue
}
for _, cond := range pod.Status.Conditions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use IsRunningAndReady instead?

func IsRunningAndReady(pod *corev1.Pod) bool {

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 kevin85421 merged commit 7b3adca into ray-project:master May 7, 2024
24 checks passed
@kevin85421
Copy link
Member

@MortalHappiness will take #1930 (review).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants