Skip to content

Commit e56e0e6

Browse files
authored
Merge pull request #1384 from tkatila/doc-updates
Doc updates
2 parents 5ed1638 + 8971280 commit e56e0e6

File tree

2 files changed

+83
-52
lines changed

2 files changed

+83
-52
lines changed

cmd/gpu_plugin/README.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ Table of Contents
44

55
* [Introduction](#introduction)
66
* [Modes and Configuration Options](#modes-and-configuration-options)
7+
* [Operation modes for different workload types](#operation-modes-for-different-workload-types)
78
* [Installation](#installation)
89
* [Prerequisites](#prerequisites)
910
* [Drivers for discrete GPUs](#drivers-for-discrete-gpus)
@@ -50,11 +51,23 @@ backend libraries can offload compute operations to GPU.
5051
| -enable-monitoring | - | disabled | Enable 'i915_monitoring' resource that provides access to all Intel GPU devices on the node |
5152
| -resource-manager | - | disabled | Enable fractional resource management, [see also dependencies](#fractional-resources) |
5253
| -shared-dev-num | int | 1 | Number of containers that can share the same GPU device |
53-
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. It is meaningful when shared-dev-num > 1, balanced mode is suitable for workload balance among GPU devices, packed mode is suitable for making full use of each GPU device, none mode is the default. Allocation policy does not have effect when resource manager is enabled. |
54+
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. For shared-dev-num > 1: _balanced_ mode spreads workloads among GPU devices, _packed_ mode fills one GPU fully before moving to next, and _none_ selects first available device from kubelet. Default is _none_. Allocation policy does not have an effect when resource manager is enabled. |
5455

5556
The plugin also accepts a number of other arguments (common to all plugins) related to logging.
5657
Please use the -h option to see the complete list of logging related options.
5758

59+
## Operation modes for different workload types
60+
61+
Intel GPU-plugin supports a few different operation modes. Depending on the workloads the cluster is running, some modes make more sense than others. Below is a table that explains the differences between the modes and suggests workload types for each mode. Mode selection applies to the whole GPU plugin deployment, so it is a cluster wide decision.
62+
63+
| Mode | Sharing | Intended workloads | Suitable for time critical workloads |
64+
|:---- |:-------- |:------- |:------- |
65+
| shared-dev-num == 1 | No, 1 container per GPU | Workloads using all GPU capacity, e.g. AI training | Yes |
66+
| shared-dev-num > 1 | Yes, >1 containers per GPU | (Batch) workloads using only part of GPU resources, e.g. inference, media transcode/analytics, or CPU bound GPU workloads | No |
67+
| shared-dev-num > 1 && resource-management | Yes and no, 1>= containers per GPU | Any. For best results, all workloads should declare their expected GPU resource usage (memory, millicores). Requires [GAS](https://github.com/intel/platform-aware-scheduling/tree/master/gpu-aware-scheduling). See also [fractional use](#fractional-resources-details) | Yes. 1000 millicores = exclusive GPU usage. See note below. |
68+
69+
> **Note**: Exclusive GPU usage with >=1000 millicores requires that also *all other GPU containers* specify (non-zero) millicores resource usage.
70+
5871
## Installation
5972

6073
The following sections detail how to obtain, build, deploy and test the GPU device plugin.
@@ -152,7 +165,9 @@ Release tagged images of the components are also available on the Docker hub, ta
152165
release version numbers in the format `x.y.z`, corresponding to the branches and releases in this
153166
repository. Thus the easiest way to deploy the plugin in your cluster is to run this command
154167

155-
Note: Replace `<RELEASE_VERSION>` with the desired [release tag](https://github.com/intel/intel-device-plugins-for-kubernetes/tags) or `main` to get `devel` images.
168+
> **Note**: Replace `<RELEASE_VERSION>` with the desired [release tag](https://github.com/intel/intel-device-plugins-for-kubernetes/tags) or `main` to get `devel` images.
169+
170+
> **Note**: Add ```--dry-run=client -o yaml``` to the ```kubectl``` commands below to visualize the yaml content being applied.
156171
157172
See [the development guide](../../DEVEL.md) for details if you want to deploy a customized version of the plugin.
158173

cmd/operator/README.md

Lines changed: 66 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Table of Contents
55
* [Introduction](#introduction)
66
* [Installation](#installation)
77
* [Upgrade](#upgrade)
8+
* [Limiting Supported Devices](#limiting-supported-devices)
89
* [Known issues](#known-issues)
910

1011
## Introduction
@@ -16,6 +17,12 @@ administrators.
1617

1718
## Installation
1819

20+
The default operator deployment depends on NFD and cert-manager. Those components have to be installed to the cluster before the operator can be deployed.
21+
22+
> **Note**: Operator can also be installed via Helm charts. See [INSTALL.md](../../INSTALL.md) for details.
23+
24+
### NFD
25+
1926
Install NFD (if it's not already installed) and node labelling rules (requires NFD v0.10+):
2027

2128
```
@@ -38,7 +45,7 @@ nfd-worker-qqq4h 1/1 Running 0 25h
3845
Note that labelling is not performed immediately. Give NFD 1 minute to pick up the rules and label nodes.
3946

4047
As a result all found devices should have correspondent labels, e.g. for Intel DLB devices the label is
41-
intel.feature.node.kubernetes.io/dlb:
48+
`intel.feature.node.kubernetes.io/dlb`:
4249
```
4350
$ kubectl get no -o json | jq .items[].metadata.labels |grep intel.feature.node.kubernetes.io/dlb
4451
"intel.feature.node.kubernetes.io/dlb": "true",
@@ -55,6 +62,8 @@ deployments/operator/samples/deviceplugin_v1_fpgadeviceplugin.yaml: intel.fea
5562
deployments/operator/samples/deviceplugin_v1_dsadeviceplugin.yaml: intel.feature.node.kubernetes.io/dsa: 'true'
5663
```
5764

65+
### Cert-Manager
66+
5867
The default operator deployment depends on [cert-manager](https://cert-manager.io/) running in the cluster.
5968
See installation instructions [here](https://cert-manager.io/docs/installation/kubectl/).
6069

@@ -68,45 +77,7 @@ cert-manager-cainjector-87c85c6ff-59sb5 1/1 Running 0 21d
6877
cert-manager-webhook-64dc9fff44-29cfc 1/1 Running 0 21d
6978
```
7079

71-
Also if your cluster operates behind a corporate proxy make sure that the API
72-
server is configured not to send requests to cluster services through the
73-
proxy. You can check that with the following command:
74-
75-
```bash
76-
$ kubectl describe pod kube-apiserver --namespace kube-system | grep -i no_proxy | grep "\.svc"
77-
```
78-
79-
In case there's no output and your cluster was deployed with `kubeadm` open
80-
`/etc/kubernetes/manifests/kube-apiserver.yaml` at the control plane nodes and
81-
append `.svc` and `.svc.cluster.local` to the `no_proxy` environment variable:
82-
83-
```yaml
84-
apiVersion: v1
85-
kind: Pod
86-
metadata:
87-
...
88-
spec:
89-
containers:
90-
- command:
91-
- kube-apiserver
92-
- --advertise-address=10.237.71.99
93-
...
94-
env:
95-
- name: http_proxy
96-
value: http://proxy.host:8080
97-
- name: https_proxy
98-
value: http://proxy.host:8433
99-
- name: no_proxy
100-
value: 127.0.0.1,localhost,.example.com,10.0.0.0/8,.svc,.svc.cluster.local
101-
...
102-
```
103-
104-
**Note:** To build clusters using `kubeadm` with the right `no_proxy` settings from the very beginning,
105-
set the cluster service names to `$no_proxy` before `kubeadm init`:
106-
107-
```
108-
$ export no_proxy=$no_proxy,.svc,.svc.cluster.local
109-
```
80+
### Device Plugin Operator
11081

11182
Finally deploy the operator itself:
11283

@@ -117,7 +88,7 @@ $ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes
11788
Now you can deploy the device plugins by creating corresponding custom resources.
11889
The samples for them are available [here](/deployments/operator/samples/).
11990

120-
## Usage
91+
### Device Plugin Custom Resource
12192

12293
Deploy your device plugin by applying its custom resource, e.g.
12394
`GpuDevicePlugin` with
@@ -134,8 +105,22 @@ NAME DESIRED READY NODE SELECTOR AGE
134105
gpudeviceplugin-sample 1 1 5s
135106
```
136107

108+
## Upgrade
109+
110+
The upgrade of the deployed plugins can be done by simply installing a new release of the operator.
111+
112+
The operator auto-upgrades operator-managed plugins (CR images and thus corresponding deployed daemonsets) to the current release of the operator.
113+
114+
During upgrade the tag in the image path is updated (e.g. docker.io/intel/intel-sgx-plugin:tag), but the rest of the path is left intact.
115+
116+
No upgrade is done for:
117+
- Non-operator managed deployments
118+
- Operator deployments without numeric tags
119+
120+
## Limiting Supported Devices
121+
137122
In order to limit the deployment to a specific device type,
138-
use one of kustomizations under deployments/operator/device.
123+
use one of kustomizations under `deployments/operator/device`.
139124

140125
For example, to limit the deployment to FPGA, use:
141126

@@ -148,20 +133,51 @@ In this case, create a new kustomization with the necessary resources
148133
that passes the desired device types to the operator using `--device`
149134
command line argument multiple times.
150135

151-
## Upgrade
136+
## Known issues
152137

153-
The upgrade of the deployed plugins can be done by simply installing a new release of the operator.
138+
### Cluster behind a proxy
154139

155-
The operator auto-upgrades operator-managed plugins (CR images and thus corresponding deployed daemonsets) to the current release of the operator.
140+
If your cluster operates behind a corporate proxy make sure that the API
141+
server is configured not to send requests to cluster services through the
142+
proxy. You can check that with the following command:
156143

157-
The [registry-url]/[namespace]/[image] are kept intact on the upgrade.
144+
```bash
145+
$ kubectl describe pod kube-apiserver --namespace kube-system | grep -i no_proxy | grep "\.svc"
146+
```
158147

159-
No upgrade is done for:
148+
In case there's no output and your cluster was deployed with `kubeadm` open
149+
`/etc/kubernetes/manifests/kube-apiserver.yaml` at the control plane nodes and
150+
append `.svc` and `.svc.cluster.local` to the `no_proxy` environment variable:
160151

161-
- Non-operator managed deployments
162-
- Operator deployments without numeric tags
152+
```yaml
153+
apiVersion: v1
154+
kind: Pod
155+
metadata:
156+
...
157+
spec:
158+
containers:
159+
- command:
160+
- kube-apiserver
161+
- --advertise-address=10.237.71.99
162+
...
163+
env:
164+
- name: http_proxy
165+
value: http://proxy.host:8080
166+
- name: https_proxy
167+
value: http://proxy.host:8433
168+
- name: no_proxy
169+
value: 127.0.0.1,localhost,.example.com,10.0.0.0/8,.svc,.svc.cluster.local
170+
...
171+
```
163172

164-
## Known issues
173+
**Note:** To build clusters using `kubeadm` with the right `no_proxy` settings from the very beginning,
174+
set the cluster service names to `$no_proxy` before `kubeadm init`:
175+
176+
```
177+
$ export no_proxy=$no_proxy,.svc,.svc.cluster.local
178+
```
179+
180+
### Leader election enabled
165181

166182
When the operator is run with leader election enabled, that is with the option
167183
`--leader-elect`, make sure the cluster is not overloaded with excessive

0 commit comments

Comments
 (0)