You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: network/README.md
-48Lines changed: 0 additions & 48 deletions
Original file line number
Diff line number
Diff line change
@@ -615,54 +615,6 @@ As I have shown in these sections it should be possible to be able to do a back-
615
615
616
616
617
617
618
-
619
-
620
-
## NUMA Affinity
621
-
622
-
[Non-uniform memory access (NUMA)](https://en.wikipedia.org/wiki/Non-uniform_memory_access) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.
623
-
As modern servers have more than one CPU to get the best performance GPUs residing in the same block as the corresponding CPU should have the processes bound to that NUMA node.
624
-
625
-
Here is a typical A100 8x GPUs server, as visualized by [hwloc](https://github.com/open-mpi/hwloc):
626
-
627
-

628
-
629
-
As you can see it has 2 CPUs, each defining a NUMA block, and each such block contains a group of 4 GPUs. The GPUs are the grey blocks that say `CoProc` with 108 compute units (SMs) and 79GB of memory.
630
-
631
-
footnote: was generated by `lstopo a100.png`
632
-
633
-
If you're using Hyper-Threads then you want to use `lstopo -l` to see the HT core count correctly. For example if you have 2 NUMA nodes with 8 accelerators and 104 physical cpu-cores and 208 logical cores - thus (`208/8=26` HT-cores per GPU), then the HT cores for GPU0 will be `[0, 1, 2, 3, 4, ..., 25, 104, 105, 106, 107, 108, ..., 129]` - first the physical cpu core counts and then the remaining HT cores, hence the gap.
634
-
635
-
636
-
#### Software Tools
637
-
638
-
note-to-self: probably belongs in its own chapter?
639
-
640
-
##### hwloc
641
-
642
-
https://github.com/open-mpi/hwloc
643
-
644
-
The Hardware Locality (hwloc) software project aims at easing the process of discovering hardware resources in parallel architectures. It offers command-line tools and a C API for consulting these resources, their locality, attributes, and interconnection. hwloc primarily aims at helping high-performance computing (HPC) applications, but is also applicable to any project seeking to exploit code and/or data locality on modern computing platforms.
645
-
646
-
Diagnostics: to take a snapshot of the server NUMA topology and save it as an image (supports many other formats)
647
-
```
648
-
lstopo a100.png
649
-
```
650
-
651
-
NUMA node binding: `hwloc-bind` - binding processes, threads and memory
652
-
653
-
Bind an existing process to a specific NUMA node:
654
-
```
655
-
hwloc-bind --pid 1234 numa:0
656
-
```
657
-
658
-
Similar software: `numactl`/`libnuma`
659
-
660
-
Some useful suggestions in [pytorch docs](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#utilize-non-uniform-memory-access-numa-controls)
Copy file name to clipboardExpand all lines: training/performance/README.md
+140Lines changed: 140 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -479,3 +479,143 @@ The full recommendations are:
479
479
3.`b*s`, `h/a`, and `h/t` should be divisible by a power of 2
480
480
4.`(b*a)/t` should be an integer
481
481
5.`t` should be small as possible
482
+
483
+
484
+
## NUMA affinity
485
+
486
+
[Non-uniform memory access (NUMA)](https://en.wikipedia.org/wiki/Non-uniform_memory_access) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.
487
+
As modern servers have more than one CPU to get the best performance accelerators residing in the same NUMA node as the corresponding CPU should have the processes bound to that same NUMA node.
488
+
489
+
First, let's understand what do NUMA nodes signify.
490
+
491
+
Here is a typical A100 8x GPUs server NUMA nodes diagram:
492
+
493
+

494
+
495
+
As you can see it has 2 CPUs, each defining a NUMA block, and each such block contains a group of 4 GPUs. The GPUs are the grey blocks that say `CoProc` with 108 compute units (SMs) and 79GB of memory.
496
+
497
+
footnote: the diagram was generated by `lstopo a100.png` from [hwloc](https://github.com/open-mpi/hwloc).
498
+
499
+
If you're using Hyper-Threads then you want to use `lstopo -l` to see the HT core count presented correctly. For example if you have 2 NUMA nodes with 8 accelerators and 104 physical cpu-cores and 208 logical cores - thus (`208/8=26` HT-cores per GPU), then the HT cores will be for:
You first get the physical cpu-core counts and then the remaining HT cores, hence the strange gap.
505
+
506
+
Now that it's clear that the various compute components are placed in 2 or more groups, to achieve the best performance we need to ensure that the components communicate within the group they belong to, and avoid any cross-talk. For example, if gpu0 belong to NUMA node 0, then the process that drives this GPU should only use cpu-cores from NUMA node 0.
507
+
508
+
The same should apply to networking or any other components that you may have control over.
509
+
510
+
Practically though in my experience so far if your workload is very light on CPU work this change will make very little difference to the overall performance, but can be quite impactful if a lot of CPU use is done. On the other hand if doing the most efficient thing is easy, even the tiniest improvement is likely to accumulate over long training jobs, so it's worth to implement, IMHO.
511
+
512
+
### NUMA process binding
513
+
514
+
There are multiple ways to accomplish the binding of processes to the cpu-cores of the right NUMA node.
515
+
516
+
#### numactl
517
+
518
+
One of the most common tools to do that is using `numactl`, which sets the NUMA affinity as it launches a new process.
519
+
520
+
For example, let's see how it can be integrated with the `torchrun` launcher.
521
+
522
+
This launcher currently needs a helper util [numa-set.sh](numa/numa-set.sh) to perform NUMA affinity settings, once you downloaded it and made it executable, you can now get the right NUMA affinity using:
'import os; cores=os.sched_getaffinity(0); print(f"{len(cores)} visible cpu cores: {cores}")'
534
+
```
535
+
536
+
On a system with 208 HT cpu-cores, you will most likely see:
537
+
538
+
```
539
+
[:0]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
540
+
[:1]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
541
+
[:2]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
542
+
[:3]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
543
+
[:4]:104 visible cpu cores: {52, 53, 54, 55, ...
544
+
[:5]:104 visible cpu cores: {52, 53, 54, 55, ...
545
+
[:6]:104 visible cpu cores: {52, 53, 54, 55, ...
546
+
[:7]:104 visible cpu cores: {52, 53, 54, 55, ...
547
+
```
548
+
549
+
The first 4 accelerators use the first half of the cpu-cores and the other 4 the second half, which matches the earlier explanations of the right setting.
'import os; cores=os.sched_getaffinity(0); print(f"{len(cores)} visible cpu cores: {cores}")'
556
+
```
557
+
You will see that all 8 processes see all 208 cpu-cores:
558
+
```
559
+
[:0]:208 visible cpu cores: {0, 1, 2, 3, ...
560
+
```
561
+
562
+
so as each process has access to any cpu-core - a cross talk may occur, which may introduce a small performance overhead.
563
+
564
+
565
+
#### os.sched_setaffinity
566
+
567
+
You can, of course, change the NUMA affinity after the program was launched. You saw the use of `os.sched_getaffinity` to get the current settings, and the corresponding `os.sched_setaffinity` is used to change it.
568
+
569
+
```
570
+
import os
571
+
os.sched_setaffinity(0, [0, 1])
572
+
```
573
+
Here we told the system that the process running this script (`0`) can only use cpu-cores `0` and `1`.
574
+
575
+
So now we just need to figure out how to programmatically get the right cpu sets for each accelerator's process. Here is how to do it with [pynvml](#pynvml).
576
+
577
+
#### pynvml
578
+
579
+
If you're using NVIDIA GPUs, `pynvml` (`pip install pynvml`) can be very helpful to get all sorts of information about the gpu and not needing to call `nvidia-smi` - in this situation we are going to use for it to tell us the correct affinity given a GPU index.
580
+
581
+
In [numa-set-pynvml.py](numa/numa-set-pynvml.py) you will find a working helper function that you could call at the very top of your training loop like so:
582
+
```
583
+
local_rank = torh.distributed.get_rank()
584
+
set_numa_affinity(0, verbose=True)
585
+
```
586
+
call it before `DataLoader` is initialized to get the workers use the right cpu-cores!
587
+
588
+
Normally, the local process rank equals the gpu index, but if one uses `CUDA_VISIBLE_DEVICES` - this might not be true any longer - if you use it, you will need to remap the process rank to the actual index:
The other gotcha can be `CUDA_DEVICE_ORDER` which typically defaults to `PCI_BUS_ID`, but one could also set it to
598
+
`CUDA_DEVICE_ORDER=FASTEST_FIRST` if you have mixed GPUs, but it's very very unlikely that you will run into this in a high end server setup, so you can safely ignore this.
599
+
600
+
601
+
#### srun
602
+
603
+
If using SLURM and you're OK with using `srun` as the launcher, rather than `torchrun`, `accelerate`, etc., it'll do all the binding work for you automatically. See the full launcher [here](../../orchestration/slurm/launchers/srun-launcher.slurm).
604
+
605
+
To make it NUMA affinity-ready all you need to add is these 2 headers:
606
+
```
607
+
#SBATCH --gres-flags=enforce-binding
608
+
#SBATCH --ntasks-per-socket=4
609
+
```
610
+
611
+
`--ntasks-per-socket=4` assumes you have 2 cpu sockets with 8 accelerators - so `8/2=4` accelerators per socket.
612
+
613
+
This is an even more precise solution, since it'd assign each process its own group of cpu-cores, rather than just give all the NUMA node 0 cpu-cores to the processes driving accelerators 0-3, and NUMA node 1 cpu-cores to the processes driving accelerators 4-7.
614
+
615
+
#### Specific launchers
616
+
617
+
Various launchers have support for NUMA affinity settings:
618
+
619
+
-[HF Accelerate](https://github.com/huggingface/accelerate) has aflag `--enable_cpu_affinity` that you add to the `accelerate` launch command and it'll do this for you. Available since `accelerate>0.28.0`.
620
+
-[torchrun](https://github.com/pytorch/pytorch) doesn't have it, but I showed how to do it in this [section](#numactl).
0 commit comments