Skip to content

Commit 6054cd0

Browse files
MengMeng96MengMeng96maaquibjeremiahschunggunandrose4u
authored
TorchServe on AKS (#644)
* use UUID as the name of model directory * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit * fit: * Name the model file with UUID * Name model files whith UUID * fix * remove .idea files * delete .idea files * delete .gitignore * fit * fit * TorchServe on AKS * fit * fit * fit * fit * fit * add document * add TorchServe-GPU * Share Helm Chart between EKS & AKS * add Azure File as PV * fit * delete .DS_store * Update kubernetes/AKS/README.md Remove reference to locale-specific documentation * EN-US * Update kubernetes/EKS/README.md Co-authored-by: jeremiahschung <[email protected]> * Update kubernetes/EKS/README.md Co-authored-by: jeremiahschung <[email protected]> * Update kubernetes/EKS/README.md Co-authored-by: jeremiahschung <[email protected]> * merge * Update kubernetes/AKS/README.md Co-authored-by: Aaqib <[email protected]> * Update kubernetes/AKS/README.md Co-authored-by: jeremiahschung <[email protected]> * change the order of AKS/README.md * change azurefile name * Update kubernetes/AKS/README.md Co-authored-by: jeremiahschung <[email protected]> * Update kubernetes/AKS/README.md Co-authored-by: jeremiahschung <[email protected]> * Update kubernetes/AKS/README.md Co-authored-by: jeremiahschung <[email protected]> * Update kubernetes/AKS/README.md Co-authored-by: jeremiahschung <[email protected]> * Update kubernetes/AKS/README.md Co-authored-by: jeremiahschung <[email protected]> * Update kubernetes/AKS/README.md Co-authored-by: jeremiahschung <[email protected]> * add result to bash block * add result to bash block * Update with section to remove aks cluster and resource group * Remove dup archiveTest() test case Co-authored-by: MengMeng96 <[email protected]> Co-authored-by: Aaqib <[email protected]> Co-authored-by: jeremiahschung <[email protected]> Co-authored-by: Joe Zhu <[email protected]> Co-authored-by: gunandrose4u <[email protected]> Co-authored-by: Geeta Chauhan <[email protected]>
1 parent 7cecea2 commit 6054cd0

20 files changed

+1204
-1
lines changed

frontend/modelarchive/src/main/java/org/pytorch/serve/archive/ModelArchive.java

-1
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,6 @@ public static ModelArchive downloadModel(
5656

5757
String marFileName = FilenameUtils.getName(url);
5858
File modelLocation = new File(modelStore, marFileName);
59-
6059
if (checkAllowedUrl(allowedUrls, url)) {
6160
if (modelLocation.exists()) {
6261
throw new FileAlreadyExistsException(marFileName);

kubernetes/AKS/README.md

+291
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
## TorchServe on Azure Kubernetes Service (AKS)
2+
3+
### 1 Create an AKS cluster
4+
5+
This quickstart requires that you are running the Azure CLI version 2.0.64 or later. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli).
6+
7+
#### 1.1 Set Azure account information
8+
9+
```az login```
10+
11+
```az account set -s your-subscription-ID```
12+
13+
#### 1.2 Create a resource group
14+
15+
An Azure resource group is a logical group in which Azure resources are deployed and managed. When you create a resource group, you are asked to specify a location. This location is where resource group metadata is stored, it is also where your resources run in Azure if you don't specify another region during resource creation. Create a resource group using the [az group create](https://docs.microsoft.com/en-us/cli/azure/group#az-group-create) command.
16+
17+
The following example creates a resource group named *myResourceGroup* in the *eastus* location.
18+
19+
```az group create --name myResourceGroup --location eastus```
20+
21+
#### 1.3 Create AKS cluster
22+
23+
Use the [az aks create](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-create) command to create an AKS cluster. The following example creates a cluster named *myAKSCluster* with one node. This will take several minutes to complete.
24+
25+
```az aks create --resource-group myResourceGroup --name myAKSCluster --node-vm-size Standard_NC6 --node-count 1 --generate-ssh-keys```
26+
27+
#### 1.4 Connect to the cluster
28+
29+
To manage a Kubernetes cluster, you use [kubectl](https://kubernetes.io/docs/user-guide/kubectl/), the Kubernetes command-line client. If you use Azure Cloud Shell, `kubectl` is already installed. To install `kubectl` locally, use the [az aks install-cli](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-install-cli) command:
30+
31+
```az aks install-cli```
32+
33+
To configure `kubectl` to connect to your Kubernetes cluster, use the [az aks get-credentials](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-get-credentials) command. This command downloads credentials and configures the Kubernetes CLI to use them.
34+
35+
```az aks get-credentials --resource-group myResourceGroup --name myAKSCluster```
36+
37+
#### 1.5 Install helm
38+
39+
```
40+
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
41+
chmod 700 get_helm.sh
42+
./get_helm.sh
43+
```
44+
45+
### 2 Deploy TorchServe on AKS
46+
47+
#### 2.1 Download the github repository and enter the kubernetes directory
48+
49+
```git clone https://github.com/pytorch/serve.git```
50+
51+
```cd serve/kubernetes/AKS```
52+
53+
#### 2.2 Install NVIDIA device plugin
54+
55+
Before the GPUs in the nodes can be used, you must deploy a DaemonSet for the NVIDIA device plugin. This DaemonSet runs a pod on each node to provide the required drivers for the GPUs.
56+
57+
```kubectl apply -f templates/nvidia-device-plugin-ds.yaml```
58+
`kubectl get pods` should show something similar to:
59+
60+
```bash
61+
NAME READY STATUS RESTARTS AGE
62+
63+
nvidia-device-plugin-daemonset-7lvxd 1/1 Running 0 42s
64+
```
65+
66+
67+
#### 2.3 Create a storage class
68+
69+
A storage class is used to define how an Azure file share is created. If multiple pods need concurrent access to the same storage volume, you need Azure Files. Create the storage class with the following kubectl apply command:
70+
71+
```kubectl apply -f templates/Azure_file_sc.yaml```
72+
73+
#### 2.4 Create PersistentVolume
74+
75+
```kubectl apply -f templates/AKS_pv_claim.yaml```
76+
77+
Your output should look similar to
78+
79+
```persistentvolumeclaim/model-store-claim created```
80+
81+
Verify that the PVC / PV is created by excuting.
82+
83+
```kubectl get pvc,pv```
84+
85+
Your output should look similar to
86+
87+
```bash
88+
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
89+
persistentvolumeclaim/model-store-claim Bound pvc-c9e235a8-ca2b-4d04-8f25-8262de1bb915 1Gi RWO managed-premium 29s
90+
91+
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
92+
persistentvolume/pvc-c9e235a8-ca2b-4d04-8f25-8262de1bb915 1Gi RWO Delete Bound default/model-store-claim managed-premium 28s
93+
```
94+
95+
#### 2.5 Create a pod and copy MAR / config files
96+
97+
Create a pod named `pod/model-store-pod` with PersistentVolume mounted so that we can copy the MAR / config files.
98+
99+
```kubectl apply -f templates/model_store_pod.yaml```
100+
101+
Your output should look similar to
102+
103+
```pod/model-store-pod created```
104+
105+
Verify that the pod is created by excuting.
106+
107+
```kubectl get po```
108+
109+
Your output should look similar to
110+
111+
```bash
112+
NAME READY STATUS RESTARTS AGE
113+
model-store-pod 1/1 Running 0 143m
114+
nvidia-device-plugin-daemonset-qccgn 1/1 Running 0 144m
115+
```
116+
117+
#### 2.6 Down and copy MAR / config files
118+
119+
```bash
120+
wget https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar
121+
wget https://torchserve.pytorch.org/mar_files/mnist.mar
122+
123+
kubectl exec --tty pod/model-store-pod -- mkdir /mnt/azure/model-store/
124+
kubectl cp squeezenet1_1.mar model-store-pod:/mnt/azure/model-store/squeezenet1_1.mar
125+
kubectl cp mnist.mar model-store-pod:/mnt/azure/model-store/mnist.mar
126+
127+
kubectl exec --tty pod/model-store-pod -- mkdir /mnt/azure/config/
128+
kubectl cp config.properties model-store-pod:/mnt/azure/config/config.properties
129+
```
130+
131+
Verify that the MAR / config files have been copied to the directory.
132+
133+
```kubectl exec --tty pod/model-store-pod -- find /mnt/azure/```
134+
135+
Your output should look similar to
136+
137+
```bash
138+
/mnt/azure/
139+
/mnt/azure/config
140+
/mnt/azure/config/config.properties
141+
/mnt/azure/lost+found
142+
/mnt/azure/model-store
143+
/mnt/azure/model-store/mnist.mar
144+
/mnt/azure/model-store/squeezenet1_1.mar
145+
```
146+
147+
#### 2.7 Install Torchserve using Helm Charts
148+
149+
Enter the Helm directory and install TorchServe using Helm Charts.
150+
```cd ../Helm```
151+
152+
```helm install ts .```
153+
154+
Your output should look similar to
155+
156+
```bash
157+
NAME: ts
158+
LAST DEPLOYED: Thu Aug 20 02:07:38 2020
159+
NAMESPACE: default
160+
STATUS: deployed
161+
REVISION: 1
162+
TEST SUITE: None
163+
```
164+
165+
#### 2.8 Check the status of TorchServe
166+
167+
```kubectl get po```
168+
169+
The installation will take a few minutes. Output like this means the installation is not completed yet.
170+
171+
```bash
172+
NAME READY STATUS RESTARTS AGE
173+
torchserve-75f5b67469-5hnsn 0/1 ContainerCreating 0 6s
174+
175+
Output like this means the installation is completed.
176+
177+
NAME READY STATUS RESTARTS AGE
178+
torchserve-75f5b67469-5hnsn 1/1 Running 0 2m36s
179+
```
180+
181+
### 3 Test Torchserve Installation
182+
183+
#### 3.1 Fetch the Load Balancer Extenal IP
184+
185+
Fetch the Load Balancer Extenal IP by executing.
186+
187+
```kubectl get svc```
188+
189+
Your output should look similar to
190+
191+
```bash
192+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
193+
kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 5d19h
194+
torchserve LoadBalancer 10.0.39.88 your-external-IP 8080:30306/TCP,8081:30442/TCP 48s
195+
```
196+
197+
#### 3.2 Test Management / Prediction APIs
198+
199+
```
200+
curl http://your-external-IP:8081/models
201+
```
202+
203+
Your output should look similar to
204+
205+
```
206+
{
207+
"models": [
208+
{
209+
"modelName": "mnist",
210+
"modelUrl": "mnist.mar"
211+
},
212+
{
213+
"modelName": "squeezenet1_1",
214+
"modelUrl": "squeezenet1_1.mar"
215+
}
216+
]
217+
}
218+
```
219+
220+
```
221+
curl http://your-external-IP:8081/models/mnist
222+
```
223+
224+
Your output should look similar to
225+
226+
```
227+
[
228+
{
229+
"modelName": "mnist",
230+
"modelVersion": "1.0",
231+
"modelUrl": "mnist.mar",
232+
"runtime": "python",
233+
"minWorkers": 5,
234+
"maxWorkers": 5,
235+
"batchSize": 1,
236+
"maxBatchDelay": 200,
237+
"loadedAtStartup": false,
238+
"workers": [
239+
{
240+
"id": "9003",
241+
"startTime": "2020-08-20T03:06:38.435Z",
242+
"status": "READY",
243+
"gpu": false,
244+
"memoryUsage": 32194560
245+
},
246+
{
247+
"id": "9004",
248+
"startTime": "2020-08-20T03:06:38.436Z",
249+
"status": "READY",
250+
"gpu": false,
251+
"memoryUsage": 31842304
252+
},
253+
{
254+
"id": "9005",
255+
"startTime": "2020-08-20T03:06:38.436Z",
256+
"status": "READY",
257+
"gpu": false,
258+
"memoryUsage": 44621824
259+
},
260+
{
261+
"id": "9006",
262+
"startTime": "2020-08-20T03:06:38.436Z",
263+
"status": "READY",
264+
"gpu": false,
265+
"memoryUsage": 42045440
266+
},
267+
{
268+
"id": "9007",
269+
"startTime": "2020-08-20T03:06:38.436Z",
270+
"status": "READY",
271+
"gpu": false,
272+
"memoryUsage": 31584256
273+
}
274+
]
275+
}
276+
]
277+
```
278+
279+
### 4 Delete the cluster
280+
281+
To avoid Azure charges, you should clean up unneeded resources. When the aks cluster is no longer needed, use the az aks delete command to remove aks cluster.
282+
283+
```
284+
az aks delete --name myAKSCluster --resource-group myResourceGroup --yes --no-wait
285+
```
286+
287+
Or if resource group is no longer needed, use the az group delete command to remove the resource group and all related resources.
288+
289+
```
290+
az group delete --name myResourceGroup --yes --no-wait
291+
```
File renamed without changes.
+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
kind: PersistentVolumeClaim
2+
apiVersion: v1
3+
metadata:
4+
name: model-store-claim
5+
spec:
6+
storageClassName: persistent-volume-azurefile
7+
accessModes:
8+
- ReadWriteOnce
9+
resources:
10+
requests:
11+
storage: 1Gi
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
kind: StorageClass
2+
apiVersion: storage.k8s.io/v1
3+
metadata:
4+
name: persistent-volume-azurefile
5+
provisioner: kubernetes.io/azure-file
6+
mountOptions:
7+
- dir_mode=0777
8+
- file_mode=0777
9+
- uid=0
10+
- gid=0
11+
- mfsymlinks
12+
- cache=strict
13+
parameters:
14+
skuName: Standard_LRS
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
apiVersion: v1
2+
kind: Pod
3+
metadata:
4+
name: model-store-pod
5+
spec:
6+
volumes:
7+
- name: model-store
8+
persistentVolumeClaim:
9+
claimName: model-store-claim
10+
containers:
11+
- name: model-store
12+
image: ubuntu
13+
command: [ "sleep" ]
14+
args: [ "infinity" ]
15+
resources:
16+
requests:
17+
cpu: 100m
18+
memory: 128Mi
19+
limits:
20+
cpu: 250m
21+
memory: 256Mi
22+
volumeMounts:
23+
- mountPath: "/mnt/azure"
24+
name: model-store
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
apiVersion: apps/v1
2+
kind: DaemonSet
3+
metadata:
4+
name: nvidia-device-plugin-daemonset
5+
spec:
6+
selector:
7+
matchLabels:
8+
name: nvidia-device-plugin-ds
9+
updateStrategy:
10+
type: RollingUpdate
11+
template:
12+
metadata:
13+
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
14+
# reserves resources for critical add-on pods so that they can be rescheduled after
15+
# a failure. This annotation works in tandem with the toleration below.
16+
annotations:
17+
scheduler.alpha.kubernetes.io/critical-pod: ""
18+
labels:
19+
name: nvidia-device-plugin-ds
20+
spec:
21+
tolerations:
22+
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
23+
# This, along with the annotation above marks this pod as a critical add-on.
24+
- key: CriticalAddonsOnly
25+
operator: Exists
26+
- key: nvidia.com/gpu
27+
operator: Exists
28+
effect: NoSchedule
29+
containers:
30+
- image: nvidia/k8s-device-plugin:1.11
31+
name: nvidia-device-plugin-ctr
32+
securityContext:
33+
allowPrivilegeEscalation: false
34+
capabilities:
35+
drop: ["ALL"]
36+
volumeMounts:
37+
- name: device-plugin
38+
mountPath: /var/lib/kubelet/device-plugins
39+
volumes:
40+
- name: device-plugin
41+
hostPath:
42+
path: /var/lib/kubelet/device-plugins

0 commit comments

Comments
 (0)