Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734
[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734
Changes from 7 commits
534357d
3637689
3b3ca3f
19db35a
1f14e45
79fb7c7
c189a68
1ec5b4d
33622cd
47a4e4c
59bb53b
ff7c56b
0520aa4
e77a72d
b91e451
6576d73
d76c904
ce83a1f
16aa1ec
0517a10
e62ac34
65774d7
cb394c4
8136580
8b0056b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if there is inconsistency between the RayCluster and RayService's RayClusterSpec? For example, what if a worker group's replica count is set to 0 in RayClusterSpec, but the Ray Autoscaler has already scaled it up to 10? Updating the RayCluster may cause a lot of running Pods to be killed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't do any special handling for this case. So that means in this case, the user-specified replica count will take precedence over the Ray Autoscaler's case.
One alternative would be to never update the
replicas
andworkerstodelete
when updating the RayCluster. The downside is that the user can never overridereplicas
.My guess is that the current approach (the first approach) is better, because the user should always have a way to set
replicas
, and this inconsistent case is an edge case, not the common case. But what are your thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From users' perspectives, it is much better to disregard the users' settings regarding
replicas
than to delete a Pod that has any running Ray task or actor.On second thought, if Ray Autoscaling is enabled, Ray Autoscaler is the only decision maker to delete Ray Pods after #1253. Hence, users can increase the number of Pods, but can't delete Pods by updating
replicas
.Users can directly update RayCluster. In addition, setting
replicas
for existing worker groups is not common. Most RayService users use Ray Autoscaler as well. If users still need to manually updatereplicas
with Ray Autoscaling, Ray Autoscaler needs to be improved.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We almost replace anything in
currentRayCluster
. Why do we need to getcurrentRayCluster
? Do we need any information from it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first approach was to not get
currentRayCluster
, but then I got this error when callingr.Update()
:rayclusters.ray.io \"rayservice-sample-raycluster-qb8z4\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
After some research it seemed that a common approach was to first get the current object, then apply changes to the object, and then call
Update()
.Do you know what the best practice is? Is it better to use
Patch()
here, or is there some third approach?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This makes sense.
In my understanding,
Patch
isn't protected by the Kubernetes optimistic concurrency model. I don't understand the use case forPatch
. We should avoid usingPatch
until we understand it more.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function's coding style (
old_spec
->oldSpec
) is inconsistent.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, fixed e62ac34
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit still has some snake case style. Ex:
newSpec_without_new_worker_groups
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed 8b0056b
I wonder if there's a linter that can check for this