|
1 | 1 | # Intel GPU NFD hook
|
2 | 2 |
|
3 |
| -This is the Node Feature Discovery binary hook implementation for the Intel |
4 |
| -GPUs. The intel-gpu-initcontainer which is built among other images can be |
5 |
| -placed as part of the gpu-plugin deployment, so that it copies this hook to the |
6 |
| -host system only in those hosts, in which also gpu-plugin is deployed. |
| 3 | +This is the [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) |
| 4 | +binary hook implementation for the Intel GPUs. The intel-gpu-initcontainer which |
| 5 | +is built among other images can be placed as part of the gpu-plugin deployment, |
| 6 | +so that it copies this hook to the host system only in those hosts, in which also |
| 7 | +gpu-plugin is deployed. |
7 | 8 |
|
8 | 9 | When NFD worker runs this hook, it will add a number of labels to the nodes,
|
9 | 10 | which can be used for example to deploy services to nodes with specific GPU
|
10 | 11 | types. Selected numeric labels can be turned into kubernetes extended resources
|
11 | 12 | by the NFD, allowing for finer grained resource management for GPU-using PODs.
|
12 | 13 |
|
13 |
| -In the NFD deployment, the hook requires /host-sys -folder to have the host /sys |
14 |
| --folder content mounted, and /host-dev to have the host /dev -folder content |
15 |
| -mounted. Write access is not necessary. |
| 14 | +In the NFD deployment, the hook requires `/host-sys` -folder to have the host `/sys`-folder content mounted. Write access is not necessary. |
| 15 | + |
| 16 | +## GPU memory |
16 | 17 |
|
17 | 18 | GPU memory amount is read from sysfs gt/gt* files and turned into a label.
|
18 |
| -There are two supported environment variables named GPU_MEMORY_OVERRIDE and |
19 |
| -GPU_MEMORY_RESERVED. Both are supposed to hold numeric values. For systems with |
| 19 | +There are two supported environment variables named `GPU_MEMORY_OVERRIDE` and |
| 20 | +`GPU_MEMORY_RESERVED`. Both are supposed to hold numeric byte amounts. For systems with |
20 | 21 | older kernel drivers or GPUs which do not support reading the GPU memory
|
21 |
| -amount, the GPU_MEMORY_OVERRIDE environment variable value is turned into a GPU |
22 |
| -memory amount label instead of a read value. GPU_MEMORY_RESERVED value will be |
| 22 | +amount, the `GPU_MEMORY_OVERRIDE` environment variable value is turned into a GPU |
| 23 | +memory amount label instead of a read value. `GPU_MEMORY_RESERVED` value will be |
23 | 24 | scoped out from the GPU memory amount found from sysfs.
|
| 25 | + |
| 26 | +## Default labels |
| 27 | + |
| 28 | +Following labels are created by default. You may turn numeric labels into extended resources with NFD. |
| 29 | + |
| 30 | +name | type | description| |
| 31 | +-----|------|------| |
| 32 | +|`gpu.intel.com/millicores`| number | node GPU count * 1000. Can be used as a finer grained shared execution fraction. |
| 33 | +|`gpu.intel.com/memory.max`| number | sum of detected [GPU memory amounts](#GPU-memory) in bytes OR environment variable value * GPU count |
| 34 | +|`gpu.intel.com/cards`| string | list of card names separated by '`.`'. The names match host `card*`-folders under `/sys/class/drm/`. |
| 35 | + |
| 36 | +## Capability labels (optional) |
| 37 | + |
| 38 | +Capability labels are created from information found inside debugfs, and therefore |
| 39 | +unfortunately require running the NFD worker as root. Due to coming from debugfs, |
| 40 | +which is not guaranteed to be stable, these are not guaranteed to be stable either. |
| 41 | +If you don't need these, simply do not run NFD worker as root, that is also more secure. |
| 42 | +Depending on your kernel driver, running the NFD hook as root may introduce following labels: |
| 43 | + |
| 44 | +name | type | description| |
| 45 | +-----|------|------| |
| 46 | +|`gpu.intel.com/platform_gen`| string | GPU platform generation name, typically a number. |
| 47 | +|`gpu.intel.com/platform_<PLATFORM_NAME>_.count`| number | GPU count for the named platform. |
| 48 | +|`gpu.intel.com/platform_<PLATFORM_NAME>_.tiles`| number | GPU tile count in the GPUs of the named platform. |
| 49 | +|`gpu.intel.com/platform_<PLATFORM_NAME>_.present`| string | "true" for indicating the presense of the GPU platform. |
| 50 | + |
| 51 | +For the above to work as intended, installed GPUs must be identical in their capabilities. |
0 commit comments