Skip to content

Commit b61057d

Browse files
committed
update
1 parent aebda3d commit b61057d

File tree

2 files changed

+26
-15
lines changed

2 files changed

+26
-15
lines changed

compute/cpu/README.md

+25-14
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,48 @@
11
# CPU
22

3-
XXX: This chapter needs a lot more work
3+
As of this writing Machine learning workloads don't use much CPU so there aren't too many things to tell in this chapter. As CPUs evolve to become more like GPUs this is like to change, so I'm expecting this chapter to evolve along the evolution of the CPUs.
44

55
## How many cpu cores do you need
66

7-
Per 1 gpu you need:
7+
Per 1 accelerator you need:
88

9-
1. 1 cpu core per process that is tied to the gpu
10-
2. 1 cpu core for each DataLoader worker process - and you need 2-4 workers.
9+
1. 1 cpu core per process that is tied to the accelerator
10+
2. 1 cpu core for each `DataLoader` worker process - and typically you need 2-4 workers.
1111

12-
2 workers is usually plenty for NLP, especially if the data is preprocessed
12+
2 workers is usually plenty for LMs, especially if the data is already preprocessed.
1313

14-
If you need to do dynamic transforms, which is often the case with computer vision models, you may need 3-4 and sometimes more workers.
14+
If you need to do dynamic transforms, which is often the case with computer vision models or VLMs, you may need 3-4 and sometimes more workers.
1515

16-
The goal is to be able to pull from the DataLoader instantly, and not block the GPU's compute, which means that you need to pre-process a bunch of samples for the next iteration, while the current iteration is running. In other words your next batch needs to take no longer than a single iteration GPU compute of the batch of the same size.
16+
The goal is to be able to pull from the `DataLoader` instantly, and not block the accelerator's compute, which means that you need to pre-process a bunch of samples for the next iteration, while the current iteration is running. In other words your next batch needs to take no longer than a single iteration accelerator compute of the batch of the same size.
1717

18-
Besides preprocessing if you're pulling dynamically from the cloud instead of local storage you also need to make sure that the data is pre-fetched fast enough to feed the workers that feed the gpu furnace.
18+
Besides preprocessing if you're pulling dynamically from the cloud instead of local storage you also need to make sure that the data is pre-fetched fast enough to feed the workers that feed the accelerator furnace.
1919

20-
Multiply that by the number of GPUs, add a few cores for the Operation system (let's say 4).
20+
Multiply that by the number of accelerators, add a few cores for the Operation system (let's say 4).
2121

22-
If the node has 8 gpus, and you have n_workers, then you need `8*(num_workers+1)+4`. If you're doing NLP, it'd be usually about 2 workers per gpu, so `8*(2+1)+4` => 28 cpu cores. If you do CV training, and, say, you need 4 workers per gpu, then it'd be `8(4+1)+4` => 44 cpu cores.
22+
If the node has 8 accelerators, and you have n_workers, then you need `8*(num_workers+1)+4`. If you're doing NLP, it'd be usually about 2 workers per accelerator, so `8*(2+1)+4` => 28 cpu cores. If you do CV training, and, say, you need 4 workers per accelerator, then it'd be `8(4+1)+4` => 44 cpu cores.
2323

2424
What happens if you have more very active processes than the total number of cpu cores? Some processes will get preempted (put in the queue for when cpu cores become available) and you absolutely want to avoid any context switching.
2525

26-
But modern cloud offerings typically have 48+ cpu-cores so usually there is no problem to have enough cores to go around.
26+
But modern cloud offerings typically have 50-100+ cpu-cores so usually there is no problem to have enough cores to go around.
27+
28+
See also [Asynchronous DataLoader](../../training/performance#asynchronous-dataloader).
29+
30+
2731

2832
### CPU offload
2933

30-
Some frameworks, like [Deepspeed](https://www.deepspeed.ai/tutorials/zero-offload/) can offload some compute work to CPU without creating an bottleneck. In which case you'd want additional cpu-cores.
34+
Some frameworks, like [Deepspeed](https://www.deepspeed.ai/tutorials/zero-offload/) can offload some compute work to CPU without creating a bottleneck. In which case you'd want additional cpu-cores.
35+
36+
37+
38+
## NUMA affinity
39+
40+
See [NUMA affinity](../../training/performance#numa-affinity).
41+
3142

3243

3344
## Hyperthreads
3445

35-
Doubles the cpu cores number
46+
[Hyper-Threads](https://en.wikipedia.org/wiki/Hyper-threading) double the cpu cores number, by virtualizing each physical core into 2 virtual ones, allowing 2 threads to use the same cpu core at the same time. Depending on the type of workload this feature may or may not increase the overall performance. Intel, the inventor of this technology, suggests a possible 30% performance increase in some situations.
3647

37-
XXX:
48+
See also [To enable Hyper-Threads or not](../../orchestration/slurm/performance.md#to-enable-hyper-threads-or-not).

orchestration/slurm/users.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ srun --pty --partition=dev --nodes=1 --ntasks=1 --cpus-per-task=96 --gres=gpu:8
106106

107107
## Hyper-Threads
108108

109-
By default, if the cpu has hyper-threads (HT), SLURM will use it. If you don't want to use HT you have to specify `--hint=nomultithread`.
109+
By default, if the cpu has [Hyper-Threads](https://en.wikipedia.org/wiki/Hyper-threading) (HT) enabled, SLURM will use it. If you don't want to use HT you have to specify `--hint=nomultithread`.
110110

111111
footnote: HT is Intel-specific naming, the general concept is simultaneous multithreading (SMT)
112112

0 commit comments

Comments
 (0)