[RayService] Add RayService High Availability Test Doc (#1986)

Yicheng-Lu-llll · web-flow · commit bbb65b4649f0 · 2024-03-25T17:28:13.000-07:00
diff --git a/docs/guidance/rayservice-high-availability.md b/docs/guidance/rayservice-high-availability.md
@@ -0,0 +1,139 @@
+# RayService high availability
+RayService provides high availability (HA) to ensure services continue serving requests without failure during scaling up, scaling down, and upgrading the RayService configuration (zero-downtime upgrade).
+
+## Quickstart
+
+### Step 1: Create a Kubernetes cluster with Kind
+
+```sh
+kind create cluster --image=kindest/node:v1.24.0
+```
+
+### Step 2: Install the KubeRay operator
+Follow the instructions in [this document](/helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator, or follow the instructions in [DEVELOPMENT.md](/ray-operator/DEVELOPMENT.md) to install the nightly KubeRay operator.
+
+### Step 3: Create a RayService and a locust cluster
+```sh
+# Path: kuberay/
+kubectl apply -f ./ray-operator/config/samples/ray-service.high-availability-locust.yaml
+kubectl get pod
+# NAME                                        READY   STATUS    RESTARTS   AGE
+# kuberay-operator-64b4fc5946-zbfqd           1/1     Running   0          72s
+# locust-cluster-head-6clr5                   1/1     Running   0          38s
+# rayservice-ha-raycluster-pfh8b-head-58xkr   2/2     Running   0          36s
+```
+The [ray-service.high-availability-locust.yaml](/ray-operator/config/samples/ray-service.high-availability-locust.yaml) has several Kubernetes objects:
+- A RayService with serve autoscaling and Pod autoscaling enabled.
+- A RayCluster functioning as locust cluster to simulate users sending requests.
+- A configmap with a locustfile sets user request levels: starts low, spikes, then drops.
+
+### Step 4: Use Locust cluster to simulate users sending requests
+```sh
+# Open a new terminal and log into the locust cluster.
+kubectl exec -it $(kubectl get pods -o=name | grep locust-cluster-head) -- bash
+
+# Install locust and download locust_runner.py.
+# locust_runner.py helps distribute the locust workers accross the RayCluster.
+pip install locust && wget https://raw.githubusercontent.com/ray-project/serve_workloads/main/microbenchmarks/locust_runner.py
+
+# Start sending requests to the RayService.
+python locust_runner.py -f /locustfile/locustfile.py --host http://rayservice-ha-serve-svc:8000
+```
+
+### Step 5: Verify high availability during scaling up and down
+
+The locust cluster sends requests to the RayService, starting with a low number of requests, then spiking, and finally dropping. This will trigger the RayService to scale up and down. You can verify the high availability by observing the Ray Pod and the failure rate in the locust terminal.
+
+```sh
+watch -n 1 "kubectl get pod"
+# Satge 1: Low request rate.
+# NAME                                                 READY   STATUS     RESTARTS   AGE
+# rayservice-ha-raycluster-pfh8b-head-58xkr            2/2     Running    0          78s
+# rayservice-ha-raycluster-pfh8b-worker-worker-rd22n   0/1     Init:0/1   0          9s
+
+# Stage 2: High request rate
+# rayservice-ha-raycluster-pfh8b-head-58xkr            2/2     Running    0          113s
+# rayservice-ha-raycluster-pfh8b-worker-worker-7thjv   0/1     Init:0/1   0          4s
+# rayservice-ha-raycluster-pfh8b-worker-worker-nt98j   0/1     Init:0/1   0          4s
+# rayservice-ha-raycluster-pfh8b-worker-worker-rd22n   1/1     Running    0          44s
+
+# Stage 3: Low request rate
+# NAME                                                 READY   STATUS        RESTARTS   AGE
+# rayservice-ha-raycluster-pfh8b-head-58xkr            2/2     Running       0          3m38s
+# rayservice-ha-raycluster-pfh8b-worker-worker-7thjv   0/1     Terminating   0          109s
+# rayservice-ha-raycluster-pfh8b-worker-worker-nt98j   0/1     Terminating   0          109s
+# rayservice-ha-raycluster-pfh8b-worker-worker-rd22n   1/1     Running       0          2m29s
+```
+Let's describe how KubeRay and Ray ensure high availability during scaling, using the example provided.
+
+In the above example, the RayService configuration is as follows:
+- Every node can have at most one serve replica.
+- The initial number of serve replicas is set to zero.
+- The head node will not be scheduled for any workloads to follow best practices.
+
+With the above settings, when serve replicas scale up:
+1. KubeRay creates a new worker Pod. Since no serve replicas are currently running, the readiness probe for the new Pod fails. As a result, the endpoint is not added to the serve service.
+2. Ray then schedules a new serve replica to the newly created worker Pod. Once the serve replica is running, the readiness probe passes, and the endpoint is added to the serve service.
+
+When serve replicas scale down:
+1. The proxy actor in the worker Pod that is scaling down changes its stage to `draining`. The readiness probe fails immediately, and the endpoint starts to be removed from the serve service. However, this process takes some time, so incoming requests are still redirected to this worker Pod for a short period.
+2. During the draining stage, the proxy actor can still redirect incoming requests. The proxy actor is only removed and changes to the `drained` stage when the following conditions are met:
+    - There are no ongoing requests.
+    - The minimum draining time has been reached, which can be controlled by an environmental variable: `RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S`.
+
+    Also, removing endpoints from the serve service does not affect the existing ongoing requests. All of the above ensures high availability.
+3. Once the worker Pod becomes idle, KubeRay removes it from the cluster.
+
+  > Note, the default value of `RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S` is 30s. You may change it to fit with your k8s cluster.
+
+### Step 6: Verify high availability during upgrade
+The locust cluster will continue sending requests for 600s. Before the 600s is up, upgrade the RayService configuration by adding a new environment variable. This will trigger a rolling update. You can verify the high availability by observing the Ray Pod and the failure rate in the locust terminal.
+```sh
+kubectl patch rayservice rayservice-ha --type='json' -p='[
+  {
+    "op": "add",
+    "path": "/spec/rayClusterConfig/headGroupSpec/template/spec/containers/0/env",
+    "value": [
+      {
+        "name": "RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S",
+        "value": "30"
+      }
+    ]
+  }
+]'
+
+watch -n 1 "kubectl get pod"
+# stage 1: New head pod is created.
+# NAME                                                 READY   STATUS    RESTARTS   AGE
+# rayservice-ha-raycluster-nhs7v-head-z6xkn            1/2     Running   0          4s
+# rayservice-ha-raycluster-pfh8b-head-58xkr            2/2     Running   0          4m30s
+# rayservice-ha-raycluster-pfh8b-worker-worker-rd22n   1/1     Running   0          3m21s
+
+# stage 2: Old head pod terminates after new head pod is ready and k8s service is fully updated.
+# NAME                                                 READY   STATUS        RESTARTS   AGE
+# rayservice-ha-raycluster-nhs7v-head-z6xkn            2/2     Running       0          91s
+# rayservice-ha-raycluster-nhs7v-worker-worker-jplrp   0/1     Init:0/1      0          3s
+# rayservice-ha-raycluster-pfh8b-head-58xkr            2/2     Terminating   0          5m57s
+# rayservice-ha-raycluster-pfh8b-worker-worker-rd22n   1/1     Terminating   0          4m48s
+```
+When a new configuration is applied, the Kuberay operator always creates a new RayCluster with the new configuration and then removes the old RayCluster.
+Here are the details of the rolling update:
+1. KubeRay creates a new RayCluster with the new configuration. At this time, all requests are still being served by the old RayCluster.
+2. After the new RayCluster and the server app on it are ready, KubeRay updates the serve service to redirect the traffic to the new RayCluster. At this point, traffic is being served by both the old and new RayCluster as it takes time to update the k8s service.
+3. After the serve service is fully updated, KubeRay removes the old RayCluster. The traffic is now fully served by the new RayCluster.
+
+### Step 7: Examine the locust results
+In your locust terminal, You will see the faile rate is 0.00%.
+```sh
+      # fails |
+|-------------|
+     0(0.00%) |
+|-------------|
+     0(0.00%) |
+```
+
+### Step 8: Clean up
+```sh
+kubectl delete -f ./ray-operator/config/samples/ray-service.high-availability-locust.yaml
+kind delete cluster
+```
diff --git a/ray-operator/config/samples/ray-service.high-availability-locust.yaml b/ray-operator/config/samples/ray-service.high-availability-locust.yaml
@@ -0,0 +1,161 @@
+apiVersion: ray.io/v1
+kind: RayService
+metadata:
+  name: rayservice-ha
+spec:
+  serveConfigV2: |
+    proxy_location: EveryNode
+    applications:
+    - name: no_ops
+      route_prefix: /
+      import_path: microbenchmarks.no_ops:app_builder
+      args:
+        num_forwards: 0
+      runtime_env:
+        working_dir: https://github.com/ray-project/serve_workloads/archive/a2e2405f3117f1b4134b6924b5f44c4ff0710c00.zip
+      deployments:
+      - name: NoOp
+        autoscaling_config:
+          initial_replicas: 0
+          min_replicas: 0
+          max_replicas: 5
+          upscale_delay_s: 3
+          downscale_delay_s: 60  
+          metrics_interval_s: 2
+          look_back_period_s: 10
+        max_replicas_per_node: 1
+        ray_actor_options:
+          num_cpus: 1
+  rayClusterConfig:
+    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
+    enableInTreeAutoscaling: true
+    autoscalerOptions:
+      idleTimeoutSeconds: 1
+    ######################headGroupSpecs#################################
+    # Ray head pod template.   
+    headGroupSpec:
+      # The `rayStartParams` are used to configure the `ray start` command.
+      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
+      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
+      rayStartParams:
+        num-cpus: "0"
+        dashboard-host: '0.0.0.0'
+      #pod template      
+      template:
+        spec:
+          containers:
+            - name: ray-head
+              image: rayproject/ray:2.9.0
+              resources:
+                limits:
+                  cpu: 2
+                  memory: 2Gi
+                requests:
+                  cpu: 2
+                  memory: 2Gi
+              ports:
+                - containerPort: 6379
+                  name: gcs-server
+                - containerPort: 8265 # Ray dashboard
+                  name: dashboard
+                - containerPort: 10001
+                  name: client
+                - containerPort: 8000
+                  name: serve
+    workerGroupSpecs:
+      # the pod replicas in this group typed worker
+      - replicas: 0
+        minReplicas: 0
+        maxReplicas: 5
+        groupName: worker
+        # The `rayStartParams` are used to configure the `ray start` command.
+        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
+        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
+        rayStartParams: {}
+        #pod template
+        template:
+          spec:
+            containers:
+              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
+                image: rayproject/ray:2.9.0
+                lifecycle:
+                  preStop:
+                    exec:
+                      command: ["/bin/sh","-c","ray stop"]
+                resources:
+                  limits:
+                    cpu: 1
+                    memory: 2Gi
+                  requests:
+                    cpu: 1
+                    memory: 2Gi
+---
+apiVersion: ray.io/v1
+kind: RayCluster
+metadata:
+  labels:
+    controller-tools.k8s.io: "1.0"
+  name: locust-cluster
+spec:
+  rayVersion: '2.9.0'
+  headGroupSpec:
+    rayStartParams:
+      dashboard-host: '0.0.0.0'
+    template:
+      spec:
+        containers:
+        - name: ray-head
+          image: rayproject/ray:2.9.0
+          resources:
+            limits:
+              cpu: 3
+              memory: 4Gi
+            requests:
+              cpu: 3
+              memory: 4Gi
+          ports:
+          - containerPort: 6379
+            name: gcs-server
+          - containerPort: 8265
+            name: dashboard
+          - containerPort: 10001
+            name: client
+          volumeMounts:
+          - mountPath: /locustfile
+            name: locustfile-volume
+        volumes:
+        - name: locustfile-volume
+          configMap:
+            name: locustfile-config
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: locustfile-config
+data:
+  locustfile.py: |
+    from locust import FastHttpUser, task, constant, LoadTestShape
+    import os
+
+    class ConstantUser(FastHttpUser):
+        wait_time = constant(float(os.environ.get("LOCUS_WAIT_TIME", "1")))
+        network_timeout = None
+        connection_timeout = None
+        @task
+        def hello_world(self):
+            self.client.post("/")
+
+    # Derived from https://github.com/locustio/locust/blob/master/examples/custom_shape/stages.py
+    class StagesShape(LoadTestShape):
+        stages = [
+            {"duration": 30, "users": 10, "spawn_rate": 10},
+            {"duration": 60, "users": 120, "spawn_rate": 10},
+            {"duration": 600, "users": 10, "spawn_rate": 10},
+        ]
+        def tick(self):
+            run_time = self.get_run_time()
+            for stage in self.stages:
+                if run_time < stage["duration"]:
+                    tick_data = (stage["users"], stage["spawn_rate"])
+                    return tick_data
+            return None
diff --git a/tests/test_sample_raycluster_yamls.py b/tests/test_sample_raycluster_yamls.py
@@ -65,6 +65,7 @@ def parse_args():
         'ray-cluster.tpu-v4-singlehost.yaml': 'Skip this test because it requires TPU resources.',
         'ray-cluster.tpu-v4-multihost.yaml' : 'Skip this test because it requires TPU resources',
         'ray-cluster.gke-bucket.yaml': 'Skip this test because it requires GKE and k8s service accounts.',
+        'ray-service.high-availability-locust.yaml': 'Skip this test because the RayCluster here is only used for testing RayService.',
     }
 
     rs = RuleSet([HeadPodNameRule(), EasyJobRule(), HeadSvcRule()])

Original file line number	Diff line number	Diff line change
`@@ -65,6 +65,7 @@ def parse_args():`
`65`	`65`	`'ray-cluster.tpu-v4-singlehost.yaml': 'Skip this test because it requires TPU resources.',`
`66`	`66`	`'ray-cluster.tpu-v4-multihost.yaml' : 'Skip this test because it requires TPU resources',`
`67`	`67`	`'ray-cluster.gke-bucket.yaml': 'Skip this test because it requires GKE and k8s service accounts.',`
	`68`	`+ 'ray-service.high-availability-locust.yaml': 'Skip this test because the RayCluster here is only used for testing RayService.',`
`68`	`69`	`}`
`69`	`70`
`70`	`71`	`rs = RuleSet([HeadPodNameRule(), EasyJobRule(), HeadSvcRule()])`