[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769

architkulkarni · 2023-12-26T17:50:28Z

RayService supports updating the underlying RayCluster in place in the case that new worker group specs are appended to WorkerGroupSpecs and the existing worker group specs are unchanged except for Replicas and WorkersToDelete, which may have been updated by the Autoscaler.

However, when the update operation is applied and the autoscaler has made updates to the Ray Cluster since them, the old values of Replicas and WorkersToDelete will be applied, which may cause some running Pods to be killed before the Autoscaler scales them up again.

One solution is to modify updateRayClusterInstance to keep the old values of Replicas and WorkersToDelete and only update the other fields. Some care may need to be taken to ensure rayClusterAnnotations[utils.HashWithoutReplicasAndWorkersToDeleteKey] is set correctly.

What will happen if there is inconsistency between the RayCluster and RayService's RayClusterSpec? For example, what if a worker group's replica count is set to 0 in RayClusterSpec, but the Ray Autoscaler has already scaled it up to 10? Updating the RayCluster may cause a lot of running Pods to be killed.

Originally posted by @kevin85421 in #1734 (comment)

The text was updated successfully, but these errors were encountered:

…cluster (#1734) Previously, when the RayCluster spec of a RayService was updated, one of two things would happen: A new cluster would be rolled out via "zero-downtime-upgrade", or In the case where only the Replicas and WorkersToDelete fields changed, nothing would happen. This behavior was added by [Bug] RayService restarts repeatedly with Autoscaler #1037 to prevent the Autoscaler from inadvertently triggering rollouts when modifying these fields.) This PR adds a third case: If WorkerGroupSpecs is modified in the following specific way and it doesn't fall into the case above, then the RayService controller will update the RayCluster instance in place without rolling out a new one. Here is the specific way that triggers the third case: The existing worker groups are not modified except for Replicas and WorkersToDelete, and one or more entries to WorkerGroupSpecs is added. In general, the updating happens in two places: For the active RayCluster For the pending RayCluster Either of these clusters two may see an update to the spec, so we must handle both of these places. In a followup, we may add the following optimization: If an existing worker group is modified and one or more entries to WorkerGroupSpecs is added, we should reject the spec. This will require using an admission webhook or storing the previous spec somehow. (If we just store the hash as we currently do, we cannot reconstruct the previous spec because all we have is the hash.) Other followup issues for this PR: [RayService] Refactor to unify cluster decision for active and pending RayClusters #1761 [RayService] [CI] Some tests for pending/active clusters may spuriously pass because head pod is not manually set to ready #1768 [RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769 --------- Signed-off-by: Archit Kulkarni <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni mentioned this issue Dec 26, 2023

[RayService] Allow updating WorkerGroupSpecs without rolling out new cluster #1734

Merged

5 tasks

anyscalesam added the enhancement New feature or request label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769

[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769

architkulkarni commented Dec 26, 2023

[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769

[RayService] [Enhancement] Avoid unnecessary pod deletion when updating RayCluster #1769

Comments

architkulkarni commented Dec 26, 2023