expand on numa

stas00 · stas00 · commit 008c4f7407ed · 2024-03-22T19:07:48.000-07:00
diff --git a/network/README.md b/network/README.md
@@ -615,54 +615,6 @@ As I have shown in these sections it should be possible to be able to do a back-
 
 
 
-
-
-## NUMA Affinity
-
-[Non-uniform memory access (NUMA)](https://en.wikipedia.org/wiki/Non-uniform_memory_access) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.
-As modern servers have more than one CPU to get the best performance GPUs residing in the same block as the corresponding CPU should have the processes bound to that NUMA node.
-
-Here is a typical A100 8x GPUs server, as visualized by [hwloc](https://github.com/open-mpi/hwloc):
-
-![a100 server numa nodes](images/a100-server-hwloc.png)
-
-As you can see it has 2 CPUs, each defining a NUMA block, and each such block contains a group of 4 GPUs. The GPUs are the grey blocks that say `CoProc` with 108 compute units (SMs) and 79GB of memory.
-
-footnote: was generated by `lstopo a100.png`
-
-If you're using Hyper-Threads then you want to use `lstopo -l` to see the HT core count correctly. For example if you have 2 NUMA nodes with 8 accelerators and 104 physical cpu-cores and 208 logical cores - thus (`208/8=26` HT-cores per GPU), then the HT cores for GPU0 will be  `[0, 1, 2, 3, 4, ..., 25, 104, 105, 106, 107, 108, ..., 129]` - first the physical cpu core counts and then the remaining HT cores, hence the gap.
-
-
-#### Software Tools
-
-note-to-self: probably belongs in its own chapter?
-
-##### hwloc
-
-https://github.com/open-mpi/hwloc
-
-The Hardware Locality (hwloc) software project aims at easing the process of discovering hardware resources in parallel architectures. It offers command-line tools and a C API for consulting these resources, their locality, attributes, and interconnection. hwloc primarily aims at helping high-performance computing (HPC) applications, but is also applicable to any project seeking to exploit code and/or data locality on modern computing platforms.
-
-Diagnostics: to take a snapshot of the server NUMA topology and save it as an image (supports many other formats)
-```
-lstopo a100.png
-```
-
-NUMA node binding: `hwloc-bind` - binding processes, threads and memory
-
-Bind an existing process to a specific NUMA node:
-```
-hwloc-bind --pid 1234 numa:0
-```
-
-Similar software: `numactl`/`libnuma`
-
-Some useful suggestions in [pytorch docs](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#utilize-non-uniform-memory-access-numa-controls)
-
-
-
-
-
 ## Important nuances
 
 ### Real network throughput
diff --git a/training/performance/README.md b/training/performance/README.md
@@ -479,3 +479,143 @@ The full recommendations are:
 3. `b*s`, `h/a`, and `h/t` should be divisible by a power of 2
 4. `(b*a)/t` should be an integer
 5. `t` should be small as possible
+
+
+## NUMA affinity
+
+[Non-uniform memory access (NUMA)](https://en.wikipedia.org/wiki/Non-uniform_memory_access) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.
+As modern servers have more than one CPU to get the best performance accelerators residing in the same NUMA node as the corresponding CPU should have the processes bound to that same NUMA node.
+
+First, let's understand what do NUMA nodes signify.
+
+Here is a typical A100 8x GPUs server NUMA nodes diagram:
+
+![a100 server numa nodes](images/a100-server-hwloc.png)
+
+As you can see it has 2 CPUs, each defining a NUMA block, and each such block contains a group of 4 GPUs. The GPUs are the grey blocks that say `CoProc` with 108 compute units (SMs) and 79GB of memory.
+
+footnote: the diagram was generated by `lstopo a100.png` from [hwloc](https://github.com/open-mpi/hwloc).
+
+If you're using Hyper-Threads then you want to use `lstopo -l` to see the HT core count presented correctly. For example if you have 2 NUMA nodes with 8 accelerators and 104 physical cpu-cores and 208 logical cores - thus (`208/8=26` HT-cores per GPU), then the HT cores will be for:
+
+- gpu0..3 `[0, 1,  2, 3, ...,  51, 104, 105, 106, ..., 129]`
+- gpu4..7 `[52, 53, 54, ..., 103, 156, 157, 158, ..., 207]`
+
+You first get the physical cpu-core counts and then the remaining HT cores, hence the strange gap.
+
+Now that it's clear that the various compute components are placed in 2 or more groups, to achieve the best performance we need to ensure that the components communicate within the group they belong to, and avoid any cross-talk. For example, if gpu0 belong to NUMA node 0, then the process that drives this GPU should only use cpu-cores from NUMA node 0.
+
+The same should apply to networking or any other components that you may have control over.
+
+Practically though in my experience so far if your workload is very light on CPU work this change will make very little difference to the overall performance, but can be quite impactful if a lot of CPU use is done. On the other hand if doing the most efficient thing is easy, even the tiniest improvement is likely to accumulate over long training jobs, so it's worth to implement, IMHO.
+
+### NUMA process binding
+
+There are multiple ways to accomplish the binding of processes to the cpu-cores of the right NUMA node.
+
+#### numactl
+
+One of the most common tools to do that is using `numactl`, which sets the NUMA affinity as it launches a new process.
+
+For example, let's see how it can be integrated with the `torchrun` launcher.
+
+This launcher currently needs a helper util [numa-set.sh](numa/numa-set.sh) to perform NUMA affinity settings, once you downloaded it and made it executable, you can now get the right NUMA affinity using:
+
+```
+torchrun --nproc_per_node=8 --role : --tee 3 --no-python ./numa-set.sh your-program.py
+```
+
+Note: you'd need `numactl` installed on your system for this util to work.
+
+For example, here is how you can validate that the assignments are correct:
+```
+torchrun --nproc_per_node=8 --role : --tee 3 --no-python ./numa-set.sh python -c \
+'import os; cores=os.sched_getaffinity(0); print(f"{len(cores)} visible cpu cores: {cores}")'
+```
+
+On a system with 208 HT cpu-cores, you will most likely see:
+
+```
+[:0]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
+[:1]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
+[:2]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
+[:3]:104 visible cpu cores: {0, 1, 2, 3, 4, 5...
+[:4]:104 visible cpu cores: {52, 53, 54, 55, ...
+[:5]:104 visible cpu cores: {52, 53, 54, 55, ...
+[:6]:104 visible cpu cores: {52, 53, 54, 55, ...
+[:7]:104 visible cpu cores: {52, 53, 54, 55, ...
+```
+
+The first 4 accelerators use the first half of the cpu-cores and the other 4 the second half, which matches the earlier explanations of the right setting.
+
+If you remove `./numa-set.sh`, you'd get:
+
+```
+torchrun --nproc_per_node=8 --role : --tee 3 --no-python python -c \
+'import os; cores=os.sched_getaffinity(0); print(f"{len(cores)} visible cpu cores: {cores}")'
+```
+You will see that all 8 processes see all 208 cpu-cores:
+```
+[:0]:208 visible cpu cores: {0, 1, 2, 3, ...
+```
+
+so as each process has access to any cpu-core - a cross talk may occur, which may introduce a small performance overhead.
+
+
+#### os.sched_setaffinity
+
+You can, of course, change the NUMA affinity after the program was launched. You saw the use of `os.sched_getaffinity` to get the current settings, and the corresponding `os.sched_setaffinity` is used to change it.
+
+```
+import os
+os.sched_setaffinity(0, [0, 1])
+```
+Here we told the system that the process running this script (`0`) can only use cpu-cores `0` and `1`.
+
+So now we just need to figure out how to programmatically get the right cpu sets for each accelerator's process. Here is how to do it with [pynvml](#pynvml).
+
+#### pynvml
+
+If you're using NVIDIA GPUs, `pynvml` (`pip install pynvml`) can be very helpful to get all sorts of information about the gpu and not needing to call `nvidia-smi` - in this situation we are going to use for it to tell us the correct affinity given a GPU index.
+
+In [numa-set-pynvml.py](numa/numa-set-pynvml.py) you will find a working helper function that you could call at the very top of your training loop like so:
+```
+local_rank = torh.distributed.get_rank()
+set_numa_affinity(0, verbose=True)
+```
+call it before `DataLoader` is initialized to get the workers use the right cpu-cores!
+
+Normally, the local process rank equals the gpu index, but if one uses `CUDA_VISIBLE_DEVICES` - this might not be true any longer - if you use it, you will need to remap the process rank to the actual index:
+
+```
+gpu_index = int(os.environ.get("LOCAL_RANK", 0))
+if "CUDA_VISIBLE_DEVICES" in os.environ:
+    ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
+    gpu_index = ids[gpu_index] # remap
+```
+
+The other gotcha can be `CUDA_DEVICE_ORDER` which typically defaults to `PCI_BUS_ID`, but one could also set it to
+`CUDA_DEVICE_ORDER=FASTEST_FIRST` if you have mixed GPUs, but it's very very unlikely that you will run into this in a high end server setup, so you can safely ignore this.
+
+
+#### srun
+
+If using SLURM and you're OK with using `srun` as the launcher, rather than `torchrun`, `accelerate`, etc., it'll do all the binding work for you automatically. See the full launcher [here](../../orchestration/slurm/launchers/srun-launcher.slurm).
+
+To make it NUMA affinity-ready all you need to add is these 2 headers:
+```
+#SBATCH --gres-flags=enforce-binding
+#SBATCH --ntasks-per-socket=4
+```
+
+`--ntasks-per-socket=4` assumes you have 2 cpu sockets with 8 accelerators - so `8/2=4` accelerators per socket.
+
+This is an even more precise solution, since it'd assign each process its own group of cpu-cores, rather than just give all the NUMA node 0 cpu-cores to the processes driving accelerators 0-3, and NUMA node 1 cpu-cores to the processes driving accelerators 4-7.
+
+#### Specific launchers
+
+Various launchers have support for NUMA affinity settings:
+
+- [HF Accelerate](https://github.com/huggingface/accelerate) has aflag `--enable_cpu_affinity` that you add to the `accelerate` launch command and it'll do this for you. Available since `accelerate>0.28.0`.
+- [torchrun](https://github.com/pytorch/pytorch) doesn't have it, but I showed how to do it in this [section](#numactl).
+- srun was covered [here](#srun).
diff --git a/training/performance/images/a100-server-hwloc.png b/training/performance/images/a100-server-hwloc.png
diff --git a/training/performance/numa/numa-set-pynvml.py b/training/performance/numa/numa-set-pynvml.py
@@ -0,0 +1,51 @@
+# this helper util will assign the cpu-cores belonging to the same NUMA node as the GPU
+
+# derived from
+# https://github.com/NVIDIA/DeepLearningExamples/blob/9dd9fcb98f56187e49c5ee280cf8dbd530dde57b/TensorFlow2/LanguageModeling/BERT/gpu_affinity.py
+
+import os
+import math
+import pynvml as nvml
+
+nvml.nvmlInit()
+
+def set_numa_affinity(gpu_index, verbose=False):
+    """This util will assign to the current process the cpu cores set that resides on the same NUMA
+    node as the GPU. Typically if you have 8 GPUs, then the first 4 are on the first NUMA node and
+    the remaining 4 are on the second.
+
+    `gpu_index` is typically the same as `LOCAL_RANK` in the distributed training, but beware that
+    `CUDA_VISIBLE_DEVICES` could impact that. e.g. `CUDA_VISIBLE_DEVICES=0,7` won't do the right
+    thing - then you will probably want to remap the ids with something like:
+
+    ```
+    if "CUDA_VISIBLE_DEVICES" in os.environ:
+        ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
+        gpu_index = ids[gpu_index] # remap
+    ```
+
+    """
+
+
+    num_elements = math.ceil(os.cpu_count() / 64)
+    handle = nvml.nvmlDeviceGetHandleByIndex(gpu_index)
+    affinity_string = ""
+    for j in nvml.nvmlDeviceGetCpuAffinity(handle, num_elements):
+        # assume nvml returns list of 64 bit ints
+        affinity_string = f"{j:064b}{affinity_string}"
+    affinity_list = [int(x) for x in affinity_string]
+    affinity_list.reverse()  # so core 0 is the 0th element
+    affinity_to_set = [i for i, e in enumerate(affinity_list) if e != 0]
+
+    if verbose:
+        cores = os.sched_getaffinity(0)
+        print(f"before: {len(cores)} visible cpu cores: {cores}")
+    os.sched_setaffinity(0, affinity_to_set)
+    if verbose:
+        cores = os.sched_getaffinity(0)
+        print(f"after: {len(cores)} visible cpu cores: {cores}")
+
+if __name__ == "__main__":
+
+    # pretend we are process that drives gpu 0
+    set_numa_affinity(0, verbose=True)
diff --git a/training/performance/numa/numa-set.sh b/training/performance/numa/numa-set.sh
@@ -0,0 +1,31 @@
+#!/usr/bin/bash
+
+# this helper util performs NUMA node binding which can be used with torchrun, and other launchers
+# contributed by https://github.com/yifuwang
+
+# 1. first make it executable:
+#
+# chmod a+x ./numa-set.sh
+#
+# 2. launch torchrun and test that it assigns the cores correctly
+#
+# torchrun --nproc_per_node=8 --no-python ./numa-set.sh \
+# python -c "import os; cs=os.sched_getaffinity(0); print(f"{len(cs)} visible cpu cores: {cs}")'
+#
+# so if your original torchrun launcher looked like:
+#
+# torchrun --nproc_per_node=8 --nnodes 2 ... train.py
+#
+# now it'll become:
+#
+# torchrun --nproc_per_node=8 --nnodes 2 ... --no-python ./numa-set.sh python train.py
+
+# Query the bus ID for device LOCAL_RANK
+BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader)
+BUS_ID=${BUS_ID,,}
+
+# Find the numa node for device LOCAL_RANK
+NODE=$(cat /sys/bus/pci/devices/${BUS_ID:4}/numa_node)
+
+echo "Starting local rank $RANK on NUMA node $NODE"
+numactl --cpunodebind=$NODE --membind=$NODE "$@"