@@ -14,9 +14,46 @@ Experimental InstaSlice works with GPU operator to create mig slices on demand.
14
14
15
15
### Install KinD cluster with GPU operator
16
16
17
+ - Make sure the GPUs on the host have MIG enabled
18
+
19
+ ``` sh
20
+ +-----------------------------------------------------------------------------------------+
21
+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
22
+ | -----------------------------------------+------------------------+----------------------+
23
+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
24
+ | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
25
+ | | | MIG M. |
26
+ | =========================================+========================+======================|
27
+ | 0 NVIDIA A100-PCIE-40GB Off | 00000000:0E:00.0 Off | On |
28
+ | N/A 36C P0 33W / 250W | 0MiB / 40960MiB | N/A Default |
29
+ | | | Enabled |
30
+ +-----------------------------------------+------------------------+----------------------+
31
+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | On |
32
+ | N/A 40C P0 32W / 250W | 0MiB / 40960MiB | N/A Default |
33
+ | | | Enabled |
34
+ +-----------------------------------------+------------------------+----------------------+
35
+
36
+ +-----------------------------------------------------------------------------------------+
37
+ | MIG devices: |
38
+ +------------------+----------------------------------+-----------+-----------------------+
39
+ | GPU GI CI MIG | Memory-Usage | Vol| Shared |
40
+ | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
41
+ | | | ECC| |
42
+ | ==================+==================================+===========+=======================|
43
+ | No MIG devices found |
44
+ +-----------------------------------------------------------------------------------------+
45
+
46
+ +-----------------------------------------------------------------------------------------+
47
+ | Processes: |
48
+ | GPU GI CI PID Type Process name GPU Memory |
49
+ | ID ID Usage |
50
+ | =========================================================================================|
51
+ | No running processes found |
52
+ ```
53
+
17
54
- Run the below script
18
55
``` sh
19
- . /setup.sh
56
+ sh ./deploy /setup.sh
20
57
```
21
58
NOTE: Please check if all the pods in GPU operator are completed or Running before moving to the next step.
22
59
@@ -99,16 +136,7 @@ Thu Apr 25 10:08:24 2024
99
136
(base) openstack@netsres62:~ /asmalvan/instaslice2$
100
137
```
101
138
102
- - Get the installed GPU operator name using command
103
139
104
- ``` sh
105
- helm list --all-namespaces
106
- (base) openstack@netsres62:~ /asmalvan/instaslice2$ helm list --all-namespaces
107
- NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
108
- gpu-operator-1714053627 gpu-operator 1 2024-04-25 10:00:30.933618302 -0400 EDT deployed gpu-operator-v23.9.2 v23.9.2
109
- ```
110
-
111
- ```
112
140
- Delete mig slices using the commmand
113
141
114
142
``` sh
@@ -147,43 +175,29 @@ Successfully created compute instance ID 0 on GPU 1 GPU instance ID 2 using p
147
175
``` sh
148
176
(base) openstack@netsres62:~ /asmalvan/instaslice2$ kubectl patch clusterpolicies.nvidia.com/cluster-policy -n gpu-operator --type merge -p ' {"spec": {"devicePlugin": {"config": {"name": "test"}}}}'
149
177
```
150
- - Label (ideally worker nodes) in the cluster to watch the configmap
151
178
152
- ``` sh
153
- kubectl label node --all nvidia.com/device-plugin.config=a100-40gb
154
- ```
155
179
You are now all set to dynamically create slices on the cluster using InstaSlice.
156
180
157
- ### Running the controller locally
158
-
159
- - Install InstaSlice CRD on the cluster
160
-
161
- ``` sh
162
- make install
163
- ```
164
-
165
- - Make sure you have sudo access, run the following command
181
+ ### Running the controller
166
182
167
- ``` sh
168
- make run
169
- ```
183
+ - Refer to section ` To Deploy on the cluster `
170
184
171
185
### Submitting the workload
172
186
173
187
- Submit a sample workload using the command
174
188
175
189
``` sh
176
190
kubectl apply -f ./samples/test-pod.yaml
177
- pod/cuda-vectoradd-2 created
191
+ pod/cuda-vectoradd-5 created
178
192
```
179
193
180
194
- check the status of the workload using commands
181
195
182
196
``` sh
183
197
kubectl get pods
184
198
NAME READY STATUS RESTARTS AGE
185
- cuda-vectoradd-2 1/1 Running 0 15s
186
- kubectl logs cuda-vectoradd-2
199
+ cuda-vectoradd-5 1/1 Running 0 15s
200
+ kubectl logs cuda-vectoradd-5
187
201
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-31cfe05c-ed13-cd17-d7aa-c63db5108c24)
188
202
MIG 1g.5gb Device 0: (UUID: MIG-c5720b34-e550-5278-90e6-d99a979aafd1)
189
203
[Vector addition of 50000 elements]
@@ -193,9 +207,101 @@ Copy output data from the CUDA device to the host memory
193
207
Test PASSED
194
208
Done
195
209
210
+ +-----------------------------------------------------------------------------------------+
211
+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
212
+ | -----------------------------------------+------------------------+----------------------+
213
+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
214
+ | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
215
+ | | | MIG M. |
216
+ | =========================================+========================+======================|
217
+ | 0 NVIDIA A100-PCIE-40GB Off | 00000000:0E:00.0 Off | On |
218
+ | N/A 52C P0 75W / 250W | 50MiB / 40960MiB | N/A Default |
219
+ | | | Enabled |
220
+ +-----------------------------------------+------------------------+----------------------+
221
+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | On |
222
+ | N/A 60C P0 75W / 250W | 37MiB / 40960MiB | N/A Default |
223
+ | | | Enabled |
224
+ +-----------------------------------------+------------------------+----------------------+
225
+
226
+ +-----------------------------------------------------------------------------------------+
227
+ | MIG devices: |
228
+ +------------------+----------------------------------+-----------+-----------------------+
229
+ | GPU GI CI MIG | Memory-Usage | Vol| Shared |
230
+ | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
231
+ | | | ECC| |
232
+ | ==================+==================================+===========+=======================|
233
+ | 0 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
234
+ | | 0MiB / 32767MiB | | |
235
+ +------------------+----------------------------------+-----------+-----------------------+
236
+ | 0 10 0 1 | 12MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
237
+ | | 0MiB / 8191MiB | | |
238
+ +------------------+----------------------------------+-----------+-----------------------+
239
+ | 1 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
240
+ | | 0MiB / 32767MiB | | |
241
+ +------------------+----------------------------------+-----------+-----------------------+
242
+
243
+ +-----------------------------------------------------------------------------------------+
244
+ | Processes: |
245
+ | GPU GI CI PID Type Process name GPU Memory |
246
+ | ID ID Usage |
247
+ | =========================================================================================|
248
+ | No running processes found |
249
+ +-----------------------------------------------------------------------------------------+
250
+
251
+ ```
252
+ ### Deleting the workload
253
+
254
+ - Delete the pod and see the newly created MIG slice deleted
255
+
256
+ ``` sh
257
+ kubectl delete pod cuda-vectoradd-5
258
+
259
+ +-----------------------------------------------------------------------------------------+
260
+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
261
+ | -----------------------------------------+------------------------+----------------------+
262
+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
263
+ | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
264
+ | | | MIG M. |
265
+ | =========================================+========================+======================|
266
+ | 0 NVIDIA A100-PCIE-40GB Off | 00000000:0E:00.0 Off | On |
267
+ | N/A 53C P0 75W / 250W | 37MiB / 40960MiB | N/A Default |
268
+ | | | Enabled |
269
+ +-----------------------------------------+------------------------+----------------------+
270
+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | On |
271
+ | N/A 60C P0 75W / 250W | 37MiB / 40960MiB | N/A Default |
272
+ | | | Enabled |
273
+ +-----------------------------------------+------------------------+----------------------+
274
+
275
+ +-----------------------------------------------------------------------------------------+
276
+ | MIG devices: |
277
+ +------------------+----------------------------------+-----------+-----------------------+
278
+ | GPU GI CI MIG | Memory-Usage | Vol| Shared |
279
+ | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
280
+ | | | ECC| |
281
+ | ==================+==================================+===========+=======================|
282
+ | 0 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
283
+ | | 0MiB / 32767MiB | | |
284
+ +------------------+----------------------------------+-----------+-----------------------+
285
+ | 1 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
286
+ | | 0MiB / 32767MiB | | |
287
+ +------------------+----------------------------------+-----------+-----------------------+
288
+
289
+ +-----------------------------------------------------------------------------------------+
290
+ | Processes: |
291
+ | GPU GI CI PID Type Process name GPU Memory |
292
+ | ID ID Usage |
293
+ | =========================================================================================|
294
+ | No running processes found |
295
+ +-----------------------------------------------------------------------------------------+
296
+
196
297
```
298
+
197
299
### To Deploy on the cluster
198
300
301
+ ** All in one command**
302
+
303
+ make docker-build && make docker-push && make deploy
304
+
199
305
** Build and push your image to the location specified by ` IMG ` :**
200
306
201
307
``` sh
0 commit comments