Skip to content

Commit 5388659

Browse files
committed
fix doc
1 parent 9a8de6a commit 5388659

File tree

1 file changed

+135
-29
lines changed

1 file changed

+135
-29
lines changed

README.md

Lines changed: 135 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,46 @@ Experimental InstaSlice works with GPU operator to create mig slices on demand.
1414

1515
### Install KinD cluster with GPU operator
1616

17+
- Make sure the GPUs on the host have MIG enabled
18+
19+
```sh
20+
+-----------------------------------------------------------------------------------------+
21+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
22+
|-----------------------------------------+------------------------+----------------------+
23+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
24+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
25+
| | | MIG M. |
26+
|=========================================+========================+======================|
27+
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:0E:00.0 Off | On |
28+
| N/A 36C P0 33W / 250W | 0MiB / 40960MiB | N/A Default |
29+
| | | Enabled |
30+
+-----------------------------------------+------------------------+----------------------+
31+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | On |
32+
| N/A 40C P0 32W / 250W | 0MiB / 40960MiB | N/A Default |
33+
| | | Enabled |
34+
+-----------------------------------------+------------------------+----------------------+
35+
36+
+-----------------------------------------------------------------------------------------+
37+
| MIG devices: |
38+
+------------------+----------------------------------+-----------+-----------------------+
39+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
40+
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
41+
| | | ECC| |
42+
|==================+==================================+===========+=======================|
43+
| No MIG devices found |
44+
+-----------------------------------------------------------------------------------------+
45+
46+
+-----------------------------------------------------------------------------------------+
47+
| Processes: |
48+
| GPU GI CI PID Type Process name GPU Memory |
49+
| ID ID Usage |
50+
|=========================================================================================|
51+
| No running processes found |
52+
```
53+
1754
- Run the below script
1855
```sh
19-
./setup.sh
56+
sh ./deploy/setup.sh
2057
```
2158
NOTE: Please check if all the pods in GPU operator are completed or Running before moving to the next step.
2259

@@ -99,16 +136,7 @@ Thu Apr 25 10:08:24 2024
99136
(base) openstack@netsres62:~/asmalvan/instaslice2$
100137
```
101138

102-
- Get the installed GPU operator name using command
103139

104-
```sh
105-
helm list --all-namespaces
106-
(base) openstack@netsres62:~/asmalvan/instaslice2$ helm list --all-namespaces
107-
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
108-
gpu-operator-1714053627 gpu-operator 1 2024-04-25 10:00:30.933618302 -0400 EDT deployed gpu-operator-v23.9.2 v23.9.2
109-
```
110-
111-
```
112140
- Delete mig slices using the commmand
113141

114142
```sh
@@ -147,43 +175,29 @@ Successfully created compute instance ID 0 on GPU 1 GPU instance ID 2 using p
147175
```sh
148176
(base) openstack@netsres62:~/asmalvan/instaslice2$ kubectl patch clusterpolicies.nvidia.com/cluster-policy -n gpu-operator --type merge -p '{"spec": {"devicePlugin": {"config": {"name": "test"}}}}'
149177
```
150-
- Label (ideally worker nodes) in the cluster to watch the configmap
151178

152-
```sh
153-
kubectl label node --all nvidia.com/device-plugin.config=a100-40gb
154-
```
155179
You are now all set to dynamically create slices on the cluster using InstaSlice.
156180

157-
### Running the controller locally
158-
159-
- Install InstaSlice CRD on the cluster
160-
161-
```sh
162-
make install
163-
```
164-
165-
- Make sure you have sudo access, run the following command
181+
### Running the controller
166182

167-
```sh
168-
make run
169-
```
183+
- Refer to section `To Deploy on the cluster`
170184

171185
### Submitting the workload
172186

173187
- Submit a sample workload using the command
174188

175189
```sh
176190
kubectl apply -f ./samples/test-pod.yaml
177-
pod/cuda-vectoradd-2 created
191+
pod/cuda-vectoradd-5 created
178192
```
179193

180194
- check the status of the workload using commands
181195

182196
```sh
183197
kubectl get pods
184198
NAME READY STATUS RESTARTS AGE
185-
cuda-vectoradd-2 1/1 Running 0 15s
186-
kubectl logs cuda-vectoradd-2
199+
cuda-vectoradd-5 1/1 Running 0 15s
200+
kubectl logs cuda-vectoradd-5
187201
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-31cfe05c-ed13-cd17-d7aa-c63db5108c24)
188202
MIG 1g.5gb Device 0: (UUID: MIG-c5720b34-e550-5278-90e6-d99a979aafd1)
189203
[Vector addition of 50000 elements]
@@ -193,9 +207,101 @@ Copy output data from the CUDA device to the host memory
193207
Test PASSED
194208
Done
195209

210+
+-----------------------------------------------------------------------------------------+
211+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
212+
|-----------------------------------------+------------------------+----------------------+
213+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
214+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
215+
| | | MIG M. |
216+
|=========================================+========================+======================|
217+
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:0E:00.0 Off | On |
218+
| N/A 52C P0 75W / 250W | 50MiB / 40960MiB | N/A Default |
219+
| | | Enabled |
220+
+-----------------------------------------+------------------------+----------------------+
221+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | On |
222+
| N/A 60C P0 75W / 250W | 37MiB / 40960MiB | N/A Default |
223+
| | | Enabled |
224+
+-----------------------------------------+------------------------+----------------------+
225+
226+
+-----------------------------------------------------------------------------------------+
227+
| MIG devices: |
228+
+------------------+----------------------------------+-----------+-----------------------+
229+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
230+
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
231+
| | | ECC| |
232+
|==================+==================================+===========+=======================|
233+
| 0 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
234+
| | 0MiB / 32767MiB | | |
235+
+------------------+----------------------------------+-----------+-----------------------+
236+
| 0 10 0 1 | 12MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
237+
| | 0MiB / 8191MiB | | |
238+
+------------------+----------------------------------+-----------+-----------------------+
239+
| 1 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
240+
| | 0MiB / 32767MiB | | |
241+
+------------------+----------------------------------+-----------+-----------------------+
242+
243+
+-----------------------------------------------------------------------------------------+
244+
| Processes: |
245+
| GPU GI CI PID Type Process name GPU Memory |
246+
| ID ID Usage |
247+
|=========================================================================================|
248+
| No running processes found |
249+
+-----------------------------------------------------------------------------------------+
250+
251+
```
252+
### Deleting the workload
253+
254+
- Delete the pod and see the newly created MIG slice deleted
255+
256+
```sh
257+
kubectl delete pod cuda-vectoradd-5
258+
259+
+-----------------------------------------------------------------------------------------+
260+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
261+
|-----------------------------------------+------------------------+----------------------+
262+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
263+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
264+
| | | MIG M. |
265+
|=========================================+========================+======================|
266+
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:0E:00.0 Off | On |
267+
| N/A 53C P0 75W / 250W | 37MiB / 40960MiB | N/A Default |
268+
| | | Enabled |
269+
+-----------------------------------------+------------------------+----------------------+
270+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | On |
271+
| N/A 60C P0 75W / 250W | 37MiB / 40960MiB | N/A Default |
272+
| | | Enabled |
273+
+-----------------------------------------+------------------------+----------------------+
274+
275+
+-----------------------------------------------------------------------------------------+
276+
| MIG devices: |
277+
+------------------+----------------------------------+-----------+-----------------------+
278+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
279+
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
280+
| | | ECC| |
281+
|==================+==================================+===========+=======================|
282+
| 0 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
283+
| | 0MiB / 32767MiB | | |
284+
+------------------+----------------------------------+-----------+-----------------------+
285+
| 1 2 0 0 | 37MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
286+
| | 0MiB / 32767MiB | | |
287+
+------------------+----------------------------------+-----------+-----------------------+
288+
289+
+-----------------------------------------------------------------------------------------+
290+
| Processes: |
291+
| GPU GI CI PID Type Process name GPU Memory |
292+
| ID ID Usage |
293+
|=========================================================================================|
294+
| No running processes found |
295+
+-----------------------------------------------------------------------------------------+
296+
196297
```
298+
197299
### To Deploy on the cluster
198300

301+
**All in one command**
302+
303+
make docker-build && make docker-push && make deploy
304+
199305
**Build and push your image to the location specified by `IMG`:**
200306

201307
```sh

0 commit comments

Comments
 (0)