|
| 1 | +## TorchServe on Azure Kubernetes Service (AKS) |
| 2 | + |
| 3 | +### 1 Create an AKS cluster |
| 4 | + |
| 5 | +This quickstart requires that you are running the Azure CLI version 2.0.64 or later. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli). |
| 6 | + |
| 7 | +#### 1.1 Set Azure account information |
| 8 | + |
| 9 | +```az login``` |
| 10 | + |
| 11 | +```az account set -s your-subscription-ID``` |
| 12 | + |
| 13 | +#### 1.2 Create a resource group |
| 14 | + |
| 15 | +An Azure resource group is a logical group in which Azure resources are deployed and managed. When you create a resource group, you are asked to specify a location. This location is where resource group metadata is stored, it is also where your resources run in Azure if you don't specify another region during resource creation. Create a resource group using the [az group create](https://docs.microsoft.com/en-us/cli/azure/group#az-group-create) command. |
| 16 | + |
| 17 | +The following example creates a resource group named *myResourceGroup* in the *eastus* location. |
| 18 | + |
| 19 | +```az group create --name myResourceGroup --location eastus``` |
| 20 | + |
| 21 | +#### 1.3 Create AKS cluster |
| 22 | + |
| 23 | +Use the [az aks create](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-create) command to create an AKS cluster. The following example creates a cluster named *myAKSCluster* with one node. This will take several minutes to complete. |
| 24 | + |
| 25 | +```az aks create --resource-group myResourceGroup --name myAKSCluster --node-vm-size Standard_NC6 --node-count 1 --generate-ssh-keys``` |
| 26 | + |
| 27 | +#### 1.4 Connect to the cluster |
| 28 | + |
| 29 | +To manage a Kubernetes cluster, you use [kubectl](https://kubernetes.io/docs/user-guide/kubectl/), the Kubernetes command-line client. If you use Azure Cloud Shell, `kubectl` is already installed. To install `kubectl` locally, use the [az aks install-cli](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-install-cli) command: |
| 30 | + |
| 31 | +```az aks install-cli``` |
| 32 | + |
| 33 | +To configure `kubectl` to connect to your Kubernetes cluster, use the [az aks get-credentials](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-get-credentials) command. This command downloads credentials and configures the Kubernetes CLI to use them. |
| 34 | + |
| 35 | +```az aks get-credentials --resource-group myResourceGroup --name myAKSCluster``` |
| 36 | + |
| 37 | +#### 1.5 Install helm |
| 38 | + |
| 39 | +``` |
| 40 | +curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 |
| 41 | +chmod 700 get_helm.sh |
| 42 | +./get_helm.sh |
| 43 | +``` |
| 44 | + |
| 45 | +### 2 Deploy TorchServe on AKS |
| 46 | + |
| 47 | +#### 2.1 Download the github repository and enter the kubernetes directory |
| 48 | + |
| 49 | +```git clone https://github.com/pytorch/serve.git``` |
| 50 | + |
| 51 | +```cd serve/kubernetes/AKS``` |
| 52 | + |
| 53 | +#### 2.2 Install NVIDIA device plugin |
| 54 | + |
| 55 | +Before the GPUs in the nodes can be used, you must deploy a DaemonSet for the NVIDIA device plugin. This DaemonSet runs a pod on each node to provide the required drivers for the GPUs. |
| 56 | + |
| 57 | +```kubectl apply -f templates/nvidia-device-plugin-ds.yaml``` |
| 58 | +`kubectl get pods` should show something similar to: |
| 59 | + |
| 60 | +```bash |
| 61 | +NAME READY STATUS RESTARTS AGE |
| 62 | + |
| 63 | +nvidia-device-plugin-daemonset-7lvxd 1/1 Running 0 42s |
| 64 | +``` |
| 65 | + |
| 66 | + |
| 67 | +#### 2.3 Create a storage class |
| 68 | + |
| 69 | +A storage class is used to define how an Azure file share is created. If multiple pods need concurrent access to the same storage volume, you need Azure Files. Create the storage class with the following kubectl apply command: |
| 70 | + |
| 71 | +```kubectl apply -f templates/Azure_file_sc.yaml``` |
| 72 | + |
| 73 | +#### 2.4 Create PersistentVolume |
| 74 | + |
| 75 | +```kubectl apply -f templates/AKS_pv_claim.yaml``` |
| 76 | + |
| 77 | +Your output should look similar to |
| 78 | + |
| 79 | +```persistentvolumeclaim/model-store-claim created``` |
| 80 | + |
| 81 | +Verify that the PVC / PV is created by excuting. |
| 82 | + |
| 83 | +```kubectl get pvc,pv``` |
| 84 | + |
| 85 | +Your output should look similar to |
| 86 | + |
| 87 | +```bash |
| 88 | +NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE |
| 89 | +persistentvolumeclaim/model-store-claim Bound pvc-c9e235a8-ca2b-4d04-8f25-8262de1bb915 1Gi RWO managed-premium 29s |
| 90 | + |
| 91 | +NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE |
| 92 | +persistentvolume/pvc-c9e235a8-ca2b-4d04-8f25-8262de1bb915 1Gi RWO Delete Bound default/model-store-claim managed-premium 28s |
| 93 | +``` |
| 94 | + |
| 95 | +#### 2.5 Create a pod and copy MAR / config files |
| 96 | + |
| 97 | +Create a pod named `pod/model-store-pod` with PersistentVolume mounted so that we can copy the MAR / config files. |
| 98 | + |
| 99 | +```kubectl apply -f templates/model_store_pod.yaml``` |
| 100 | + |
| 101 | +Your output should look similar to |
| 102 | + |
| 103 | +```pod/model-store-pod created``` |
| 104 | + |
| 105 | +Verify that the pod is created by excuting. |
| 106 | + |
| 107 | +```kubectl get po``` |
| 108 | + |
| 109 | +Your output should look similar to |
| 110 | + |
| 111 | +```bash |
| 112 | +NAME READY STATUS RESTARTS AGE |
| 113 | +model-store-pod 1/1 Running 0 143m |
| 114 | +nvidia-device-plugin-daemonset-qccgn 1/1 Running 0 144m |
| 115 | +``` |
| 116 | + |
| 117 | +#### 2.6 Down and copy MAR / config files |
| 118 | + |
| 119 | +```bash |
| 120 | +wget https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar |
| 121 | +wget https://torchserve.pytorch.org/mar_files/mnist.mar |
| 122 | + |
| 123 | +kubectl exec --tty pod/model-store-pod -- mkdir /mnt/azure/model-store/ |
| 124 | +kubectl cp squeezenet1_1.mar model-store-pod:/mnt/azure/model-store/squeezenet1_1.mar |
| 125 | +kubectl cp mnist.mar model-store-pod:/mnt/azure/model-store/mnist.mar |
| 126 | + |
| 127 | +kubectl exec --tty pod/model-store-pod -- mkdir /mnt/azure/config/ |
| 128 | +kubectl cp config.properties model-store-pod:/mnt/azure/config/config.properties |
| 129 | +``` |
| 130 | + |
| 131 | +Verify that the MAR / config files have been copied to the directory. |
| 132 | + |
| 133 | +```kubectl exec --tty pod/model-store-pod -- find /mnt/azure/``` |
| 134 | + |
| 135 | +Your output should look similar to |
| 136 | + |
| 137 | +```bash |
| 138 | +/mnt/azure/ |
| 139 | +/mnt/azure/config |
| 140 | +/mnt/azure/config/config.properties |
| 141 | +/mnt/azure/lost+found |
| 142 | +/mnt/azure/model-store |
| 143 | +/mnt/azure/model-store/mnist.mar |
| 144 | +/mnt/azure/model-store/squeezenet1_1.mar |
| 145 | +``` |
| 146 | + |
| 147 | +#### 2.7 Install Torchserve using Helm Charts |
| 148 | + |
| 149 | +Enter the Helm directory and install TorchServe using Helm Charts. |
| 150 | +```cd ../Helm``` |
| 151 | + |
| 152 | +```helm install ts .``` |
| 153 | + |
| 154 | +Your output should look similar to |
| 155 | + |
| 156 | +```bash |
| 157 | +NAME: ts |
| 158 | +LAST DEPLOYED: Thu Aug 20 02:07:38 2020 |
| 159 | +NAMESPACE: default |
| 160 | +STATUS: deployed |
| 161 | +REVISION: 1 |
| 162 | +TEST SUITE: None |
| 163 | +``` |
| 164 | + |
| 165 | +#### 2.8 Check the status of TorchServe |
| 166 | + |
| 167 | +```kubectl get po``` |
| 168 | + |
| 169 | +The installation will take a few minutes. Output like this means the installation is not completed yet. |
| 170 | + |
| 171 | +```bash |
| 172 | +NAME READY STATUS RESTARTS AGE |
| 173 | +torchserve-75f5b67469-5hnsn 0/1 ContainerCreating 0 6s |
| 174 | + |
| 175 | +Output like this means the installation is completed. |
| 176 | + |
| 177 | +NAME READY STATUS RESTARTS AGE |
| 178 | +torchserve-75f5b67469-5hnsn 1/1 Running 0 2m36s |
| 179 | +``` |
| 180 | + |
| 181 | +### 3 Test Torchserve Installation |
| 182 | + |
| 183 | +#### 3.1 Fetch the Load Balancer Extenal IP |
| 184 | + |
| 185 | +Fetch the Load Balancer Extenal IP by executing. |
| 186 | + |
| 187 | +```kubectl get svc``` |
| 188 | + |
| 189 | +Your output should look similar to |
| 190 | + |
| 191 | +```bash |
| 192 | +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE |
| 193 | +kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 5d19h |
| 194 | +torchserve LoadBalancer 10.0.39.88 your-external-IP 8080:30306/TCP,8081:30442/TCP 48s |
| 195 | +``` |
| 196 | + |
| 197 | +#### 3.2 Test Management / Prediction APIs |
| 198 | + |
| 199 | +``` |
| 200 | +curl http://your-external-IP:8081/models |
| 201 | +``` |
| 202 | + |
| 203 | +Your output should look similar to |
| 204 | + |
| 205 | +``` |
| 206 | +{ |
| 207 | + "models": [ |
| 208 | + { |
| 209 | + "modelName": "mnist", |
| 210 | + "modelUrl": "mnist.mar" |
| 211 | + }, |
| 212 | + { |
| 213 | + "modelName": "squeezenet1_1", |
| 214 | + "modelUrl": "squeezenet1_1.mar" |
| 215 | + } |
| 216 | + ] |
| 217 | +} |
| 218 | +``` |
| 219 | + |
| 220 | +``` |
| 221 | +curl http://your-external-IP:8081/models/mnist |
| 222 | +``` |
| 223 | + |
| 224 | +Your output should look similar to |
| 225 | + |
| 226 | +``` |
| 227 | +[ |
| 228 | + { |
| 229 | + "modelName": "mnist", |
| 230 | + "modelVersion": "1.0", |
| 231 | + "modelUrl": "mnist.mar", |
| 232 | + "runtime": "python", |
| 233 | + "minWorkers": 5, |
| 234 | + "maxWorkers": 5, |
| 235 | + "batchSize": 1, |
| 236 | + "maxBatchDelay": 200, |
| 237 | + "loadedAtStartup": false, |
| 238 | + "workers": [ |
| 239 | + { |
| 240 | + "id": "9003", |
| 241 | + "startTime": "2020-08-20T03:06:38.435Z", |
| 242 | + "status": "READY", |
| 243 | + "gpu": false, |
| 244 | + "memoryUsage": 32194560 |
| 245 | + }, |
| 246 | + { |
| 247 | + "id": "9004", |
| 248 | + "startTime": "2020-08-20T03:06:38.436Z", |
| 249 | + "status": "READY", |
| 250 | + "gpu": false, |
| 251 | + "memoryUsage": 31842304 |
| 252 | + }, |
| 253 | + { |
| 254 | + "id": "9005", |
| 255 | + "startTime": "2020-08-20T03:06:38.436Z", |
| 256 | + "status": "READY", |
| 257 | + "gpu": false, |
| 258 | + "memoryUsage": 44621824 |
| 259 | + }, |
| 260 | + { |
| 261 | + "id": "9006", |
| 262 | + "startTime": "2020-08-20T03:06:38.436Z", |
| 263 | + "status": "READY", |
| 264 | + "gpu": false, |
| 265 | + "memoryUsage": 42045440 |
| 266 | + }, |
| 267 | + { |
| 268 | + "id": "9007", |
| 269 | + "startTime": "2020-08-20T03:06:38.436Z", |
| 270 | + "status": "READY", |
| 271 | + "gpu": false, |
| 272 | + "memoryUsage": 31584256 |
| 273 | + } |
| 274 | + ] |
| 275 | + } |
| 276 | +] |
| 277 | +``` |
| 278 | + |
| 279 | +### 4 Delete the cluster |
| 280 | + |
| 281 | +To avoid Azure charges, you should clean up unneeded resources. When the aks cluster is no longer needed, use the az aks delete command to remove aks cluster. |
| 282 | + |
| 283 | +``` |
| 284 | +az aks delete --name myAKSCluster --resource-group myResourceGroup --yes --no-wait |
| 285 | +``` |
| 286 | + |
| 287 | +Or if resource group is no longer needed, use the az group delete command to remove the resource group and all related resources. |
| 288 | + |
| 289 | +``` |
| 290 | +az group delete --name myResourceGroup --yes --no-wait |
| 291 | +``` |
0 commit comments