Skip to content

Commit 008c4f7

Browse files
committed
expand on numa
1 parent 105a9c4 commit 008c4f7

File tree

5 files changed

+222
-48
lines changed

5 files changed

+222
-48
lines changed

network/README.md

Lines changed: 0 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -615,54 +615,6 @@ As I have shown in these sections it should be possible to be able to do a back-
615615

616616

617617

618-
619-
620-
## NUMA Affinity
621-
622-
[Non-uniform memory access (NUMA)](https://en.wikipedia.org/wiki/Non-uniform_memory_access) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.
623-
As modern servers have more than one CPU to get the best performance GPUs residing in the same block as the corresponding CPU should have the processes bound to that NUMA node.
624-
625-
Here is a typical A100 8x GPUs server, as visualized by [hwloc](https://github.com/open-mpi/hwloc):
626-
627-
![a100 server numa nodes](images/a100-server-hwloc.png)
628-
629-
As you can see it has 2 CPUs, each defining a NUMA block, and each such block contains a group of 4 GPUs. The GPUs are the grey blocks that say `CoProc` with 108 compute units (SMs) and 79GB of memory.
630-
631-
footnote: was generated by `lstopo a100.png`
632-
633-
If you're using Hyper-Threads then you want to use `lstopo -l` to see the HT core count correctly. For example if you have 2 NUMA nodes with 8 accelerators and 104 physical cpu-cores and 208 logical cores - thus (`208/8=26` HT-cores per GPU), then the HT cores for GPU0 will be `[0, 1, 2, 3, 4, ..., 25, 104, 105, 106, 107, 108, ..., 129]` - first the physical cpu core counts and then the remaining HT cores, hence the gap.
634-
635-
636-
#### Software Tools
637-
638-
note-to-self: probably belongs in its own chapter?
639-
640-
##### hwloc
641-
642-
https://github.com/open-mpi/hwloc
643-
644-
The Hardware Locality (hwloc) software project aims at easing the process of discovering hardware resources in parallel architectures. It offers command-line tools and a C API for consulting these resources, their locality, attributes, and interconnection. hwloc primarily aims at helping high-performance computing (HPC) applications, but is also applicable to any project seeking to exploit code and/or data locality on modern computing platforms.
645-
646-
Diagnostics: to take a snapshot of the server NUMA topology and save it as an image (supports many other formats)
647-
```
648-
lstopo a100.png
649-
```
650-
651-
NUMA node binding: `hwloc-bind` - binding processes, threads and memory
652-
653-
Bind an existing process to a specific NUMA node:
654-
```
655-
hwloc-bind --pid 1234 numa:0
656-
```
657-
658-
Similar software: `numactl`/`libnuma`
659-
660-
Some useful suggestions in [pytorch docs](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#utilize-non-uniform-memory-access-numa-controls)
661-
662-
663-
664-
665-
666618
## Important nuances
667619

668620
### Real network throughput

training/performance/README.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -479,3 +479,143 @@ The full recommendations are:
479479
3. `b*s`, `h/a`, and `h/t` should be divisible by a power of 2
480480
4. `(b*a)/t` should be an integer
481481
5. `t` should be small as possible
482+
483+
484+
## NUMA affinity
485+
486+
[Non-uniform memory access (NUMA)](https://en.wikipedia.org/wiki/Non-uniform_memory_access) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.
487+
As modern servers have more than one CPU to get the best performance accelerators residing in the same NUMA node as the corresponding CPU should have the processes bound to that same NUMA node.
488+
489+
First, let's understand what do NUMA nodes signify.
490+
491+
Here is a typical A100 8x GPUs server NUMA nodes diagram:
492+
493+
![a100 server numa nodes](images/a100-server-hwloc.png)
494+
495+
As you can see it has 2 CPUs, each defining a NUMA block, and each such block contains a group of 4 GPUs. The GPUs are the grey blocks that say `CoProc` with 108 compute units (SMs) and 79GB of memory.
496+
497+
footnote: the diagram was generated by `lstopo a100.png` from [hwloc](https://github.com/open-mpi/hwloc).
498+
499+
If you're using Hyper-Threads then you want to use `lstopo -l` to see the HT core count presented correctly. For example if you have 2 NUMA nodes with 8 accelerators and 104 physical cpu-cores and 208 logical cores - thus (`208/8=26` HT-cores per GPU), then the HT cores will be for:
500+
501+
- gpu0..3 `[0, 1, 2, 3, ..., 51, 104, 105, 106, ..., 129]`
502+
- gpu4..7 `[52, 53, 54, ..., 103, 156, 157, 158, ..., 207]`
503+
504+
You first get the physical cpu-core counts and then the remaining HT cores, hence the strange gap.
505+
506+
Now that it's clear that the various compute components are placed in 2 or more groups, to achieve the best performance we need to ensure that the components communicate within the group they belong to, and avoid any cross-talk. For example, if gpu0 belong to NUMA node 0, then the process that drives this GPU should only use cpu-cores from NUMA node 0.
507+
508+
The same should apply to networking or any other components that you may have control over.
509+
510+
Practically though in my experience so far if your workload is very light on CPU work this change will make very little difference to the overall performance, but can be quite impactful if a lot of CPU use is done. On the other hand if doing the most efficient thing is easy, even the tiniest improvement is likely to accumulate over long training jobs, so it's worth to implement, IMHO.
511+
512+
### NUMA process binding
513+
514+
There are multiple ways to accomplish the binding of processes to the cpu-cores of the right NUMA node.
515+
516+
#### numactl
517+
518+
One of the most common tools to do that is using `numactl`, which sets the NUMA affinity as it launches a new process.
519+
520+
For example, let's see how it can be integrated with the `torchrun` launcher.
521+
522+
This launcher currently needs a helper util [numa-set.sh](numa/numa-set.sh) to perform NUMA affinity settings, once you downloaded it and made it executable, you can now get the right NUMA affinity using:
523+
524+
```
525+
torchrun --nproc_per_node=8 --role : --tee 3 --no-python ./numa-set.sh your-program.py
526+
```
527+
528+
Note: you'd need `numactl` installed on your system for this util to work.
529+
530+
For example, here is how you can validate that the assignments are correct:
531+
```
532+
torchrun --nproc_per_node=8 --role : --tee 3 --no-python ./numa-set.sh python -c \
533+
'import os; cores=os.sched_getaffinity(0); print(f"{len(cores)} visible cpu cores: {cores}")'
534+
```
535+
536+
On a system with 208 HT cpu-cores, you will most likely see:
537+
538+
```
539+
[:0]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
540+
[:1]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
541+
[:2]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
542+
[:3]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
543+
[:4]:104 visible cpu cores: {52, 53, 54, 55, ...
544+
[:5]:104 visible cpu cores: {52, 53, 54, 55, ...
545+
[:6]:104 visible cpu cores: {52, 53, 54, 55, ...
546+
[:7]:104 visible cpu cores: {52, 53, 54, 55, ...
547+
```
548+
549+
The first 4 accelerators use the first half of the cpu-cores and the other 4 the second half, which matches the earlier explanations of the right setting.
550+
551+
If you remove `./numa-set.sh`, you'd get:
552+
553+
```
554+
torchrun --nproc_per_node=8 --role : --tee 3 --no-python python -c \
555+
'import os; cores=os.sched_getaffinity(0); print(f"{len(cores)} visible cpu cores: {cores}")'
556+
```
557+
You will see that all 8 processes see all 208 cpu-cores:
558+
```
559+
[:0]:208 visible cpu cores: {0, 1, 2, 3, ...
560+
```
561+
562+
so as each process has access to any cpu-core - a cross talk may occur, which may introduce a small performance overhead.
563+
564+
565+
#### os.sched_setaffinity
566+
567+
You can, of course, change the NUMA affinity after the program was launched. You saw the use of `os.sched_getaffinity` to get the current settings, and the corresponding `os.sched_setaffinity` is used to change it.
568+
569+
```
570+
import os
571+
os.sched_setaffinity(0, [0, 1])
572+
```
573+
Here we told the system that the process running this script (`0`) can only use cpu-cores `0` and `1`.
574+
575+
So now we just need to figure out how to programmatically get the right cpu sets for each accelerator's process. Here is how to do it with [pynvml](#pynvml).
576+
577+
#### pynvml
578+
579+
If you're using NVIDIA GPUs, `pynvml` (`pip install pynvml`) can be very helpful to get all sorts of information about the gpu and not needing to call `nvidia-smi` - in this situation we are going to use for it to tell us the correct affinity given a GPU index.
580+
581+
In [numa-set-pynvml.py](numa/numa-set-pynvml.py) you will find a working helper function that you could call at the very top of your training loop like so:
582+
```
583+
local_rank = torh.distributed.get_rank()
584+
set_numa_affinity(0, verbose=True)
585+
```
586+
call it before `DataLoader` is initialized to get the workers use the right cpu-cores!
587+
588+
Normally, the local process rank equals the gpu index, but if one uses `CUDA_VISIBLE_DEVICES` - this might not be true any longer - if you use it, you will need to remap the process rank to the actual index:
589+
590+
```
591+
gpu_index = int(os.environ.get("LOCAL_RANK", 0))
592+
if "CUDA_VISIBLE_DEVICES" in os.environ:
593+
ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
594+
gpu_index = ids[gpu_index] # remap
595+
```
596+
597+
The other gotcha can be `CUDA_DEVICE_ORDER` which typically defaults to `PCI_BUS_ID`, but one could also set it to
598+
`CUDA_DEVICE_ORDER=FASTEST_FIRST` if you have mixed GPUs, but it's very very unlikely that you will run into this in a high end server setup, so you can safely ignore this.
599+
600+
601+
#### srun
602+
603+
If using SLURM and you're OK with using `srun` as the launcher, rather than `torchrun`, `accelerate`, etc., it'll do all the binding work for you automatically. See the full launcher [here](../../orchestration/slurm/launchers/srun-launcher.slurm).
604+
605+
To make it NUMA affinity-ready all you need to add is these 2 headers:
606+
```
607+
#SBATCH --gres-flags=enforce-binding
608+
#SBATCH --ntasks-per-socket=4
609+
```
610+
611+
`--ntasks-per-socket=4` assumes you have 2 cpu sockets with 8 accelerators - so `8/2=4` accelerators per socket.
612+
613+
This is an even more precise solution, since it'd assign each process its own group of cpu-cores, rather than just give all the NUMA node 0 cpu-cores to the processes driving accelerators 0-3, and NUMA node 1 cpu-cores to the processes driving accelerators 4-7.
614+
615+
#### Specific launchers
616+
617+
Various launchers have support for NUMA affinity settings:
618+
619+
- [HF Accelerate](https://github.com/huggingface/accelerate) has aflag `--enable_cpu_affinity` that you add to the `accelerate` launch command and it'll do this for you. Available since `accelerate>0.28.0`.
620+
- [torchrun](https://github.com/pytorch/pytorch) doesn't have it, but I showed how to do it in this [section](#numactl).
621+
- srun was covered [here](#srun).
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# this helper util will assign the cpu-cores belonging to the same NUMA node as the GPU
2+
3+
# derived from
4+
# https://github.com/NVIDIA/DeepLearningExamples/blob/9dd9fcb98f56187e49c5ee280cf8dbd530dde57b/TensorFlow2/LanguageModeling/BERT/gpu_affinity.py
5+
6+
import os
7+
import math
8+
import pynvml as nvml
9+
10+
nvml.nvmlInit()
11+
12+
def set_numa_affinity(gpu_index, verbose=False):
13+
"""This util will assign to the current process the cpu cores set that resides on the same NUMA
14+
node as the GPU. Typically if you have 8 GPUs, then the first 4 are on the first NUMA node and
15+
the remaining 4 are on the second.
16+
17+
`gpu_index` is typically the same as `LOCAL_RANK` in the distributed training, but beware that
18+
`CUDA_VISIBLE_DEVICES` could impact that. e.g. `CUDA_VISIBLE_DEVICES=0,7` won't do the right
19+
thing - then you will probably want to remap the ids with something like:
20+
21+
```
22+
if "CUDA_VISIBLE_DEVICES" in os.environ:
23+
ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
24+
gpu_index = ids[gpu_index] # remap
25+
```
26+
27+
"""
28+
29+
30+
num_elements = math.ceil(os.cpu_count() / 64)
31+
handle = nvml.nvmlDeviceGetHandleByIndex(gpu_index)
32+
affinity_string = ""
33+
for j in nvml.nvmlDeviceGetCpuAffinity(handle, num_elements):
34+
# assume nvml returns list of 64 bit ints
35+
affinity_string = f"{j:064b}{affinity_string}"
36+
affinity_list = [int(x) for x in affinity_string]
37+
affinity_list.reverse() # so core 0 is the 0th element
38+
affinity_to_set = [i for i, e in enumerate(affinity_list) if e != 0]
39+
40+
if verbose:
41+
cores = os.sched_getaffinity(0)
42+
print(f"before: {len(cores)} visible cpu cores: {cores}")
43+
os.sched_setaffinity(0, affinity_to_set)
44+
if verbose:
45+
cores = os.sched_getaffinity(0)
46+
print(f"after: {len(cores)} visible cpu cores: {cores}")
47+
48+
if __name__ == "__main__":
49+
50+
# pretend we are process that drives gpu 0
51+
set_numa_affinity(0, verbose=True)

training/performance/numa/numa-set.sh

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
#!/usr/bin/bash
2+
3+
# this helper util performs NUMA node binding which can be used with torchrun, and other launchers
4+
# contributed by https://github.com/yifuwang
5+
6+
# 1. first make it executable:
7+
#
8+
# chmod a+x ./numa-set.sh
9+
#
10+
# 2. launch torchrun and test that it assigns the cores correctly
11+
#
12+
# torchrun --nproc_per_node=8 --no-python ./numa-set.sh \
13+
# python -c "import os; cs=os.sched_getaffinity(0); print(f"{len(cs)} visible cpu cores: {cs}")'
14+
#
15+
# so if your original torchrun launcher looked like:
16+
#
17+
# torchrun --nproc_per_node=8 --nnodes 2 ... train.py
18+
#
19+
# now it'll become:
20+
#
21+
# torchrun --nproc_per_node=8 --nnodes 2 ... --no-python ./numa-set.sh python train.py
22+
23+
# Query the bus ID for device LOCAL_RANK
24+
BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader)
25+
BUS_ID=${BUS_ID,,}
26+
27+
# Find the numa node for device LOCAL_RANK
28+
NODE=$(cat /sys/bus/pci/devices/${BUS_ID:4}/numa_node)
29+
30+
echo "Starting local rank $RANK on NUMA node $NODE"
31+
numactl --cpunodebind=$NODE --membind=$NODE "$@"

0 commit comments

Comments
 (0)