You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RayService supports updating the underlying RayCluster in place in the case that new worker group specs are appended to WorkerGroupSpecs and the existing worker group specs are unchanged except for Replicas and WorkersToDelete, which may have been updated by the Autoscaler.
However, when the update operation is applied and the autoscaler has made updates to the Ray Cluster since them, the old values of Replicas and WorkersToDelete will be applied, which may cause some running Pods to be killed before the Autoscaler scales them up again.
One solution is to modify updateRayClusterInstance to keep the old values of Replicas and WorkersToDelete and only update the other fields. Some care may need to be taken to ensure rayClusterAnnotations[utils.HashWithoutReplicasAndWorkersToDeleteKey] is set correctly.
What will happen if there is inconsistency between the RayCluster and RayService's RayClusterSpec? For example, what if a worker group's replica count is set to 0 in RayClusterSpec, but the Ray Autoscaler has already scaled it up to 10? Updating the RayCluster may cause a lot of running Pods to be killed.
…cluster (#1734)
Previously, when the RayCluster spec of a RayService was updated, one of two things would happen:
A new cluster would be rolled out via "zero-downtime-upgrade", or
In the case where only the Replicas and WorkersToDelete fields changed, nothing would happen. This behavior was added by [Bug] RayService restarts repeatedly with Autoscaler #1037 to prevent the Autoscaler from inadvertently triggering rollouts when modifying these fields.)
This PR adds a third case: If WorkerGroupSpecs is modified in the following specific way and it doesn't fall into the case above, then the RayService controller will update the RayCluster instance in place without rolling out a new one.
Here is the specific way that triggers the third case:
The existing worker groups are not modified except for Replicas and WorkersToDelete, and one or more entries to WorkerGroupSpecs is added.
In general, the updating happens in two places:
For the active RayCluster
For the pending RayCluster
Either of these clusters two may see an update to the spec, so we must handle both of these places.
In a followup, we may add the following optimization: If an existing worker group is modified and one or more entries to WorkerGroupSpecs is added, we should reject the spec. This will require using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.)
Other followup issues for this PR:
[RayService] Refactor to unify cluster decision for active and pending RayClusters #1761
[RayService] [CI] Some tests for pending/active clusters may spuriously pass because head pod is not manually set to ready #1768
[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769
---------
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
RayService supports updating the underlying RayCluster in place in the case that new worker group specs are appended to
WorkerGroupSpecs
and the existing worker group specs are unchanged except forReplicas
andWorkersToDelete
, which may have been updated by the Autoscaler.However, when the update operation is applied and the autoscaler has made updates to the Ray Cluster since them, the old values of
Replicas
andWorkersToDelete
will be applied, which may cause some running Pods to be killed before the Autoscaler scales them up again.One solution is to modify
updateRayClusterInstance
to keep the old values ofReplicas
andWorkersToDelete
and only update the other fields. Some care may need to be taken to ensurerayClusterAnnotations[utils.HashWithoutReplicasAndWorkersToDeleteKey]
is set correctly.Originally posted by @kevin85421 in #1734 (comment)
The text was updated successfully, but these errors were encountered: