Here are my comments for the review of episode 6:
- Is the rendered public version up-to-date? It feels a little bare-bones (e.g., introduction is just some keywords and CLI arguments).
- We should settle on a uniform way to run MPI jobs (
mpirun vs srun, -n vs -np vs "let Slurm decide"). Not so much an issue within this episode, but between them.
- The output of
--report-bindings is probably not beginner-friendly enough that you can just expect people to understand it. There should be an example output and some text going through each bit of information within.
- "CPU core, socket, or NUMA region." - Is there an explanation earlier what core, socket and NUMA mean? Maybe some lstopo output with an explanation would help here.
- It's a bit jarring that the course goes from requesting resources via Slurm to just assuming mpirun can use whatever CPU cores it wants. I guess this assumes that you have allocated a node exclusively and are working on it interactively? If so, that is a bit confusing as for production jobs you would need to ask Slurm as well to bind resources adequately, right? In general, some of the explanations are very MPI focused and could perhaps benefit from some generalization towards Slurm resource management.
- "Exercise: Understanding Process and Thread Binding": I feel like the learning outcome is a bit unclear here. It feels like this is just a more complicated version of the first example of core oversubscription. If I understand it correctly, the NUMA and socket binding is only relevant insofar as the domains have different sizes. If that is correct, then I don't think it is clear to the learners why CPU sockets / NUMA domains are relevant at all for pinning (i.e., why not just pin any other combination of cores as long as each MPI rank can fill all its threads? In other words, why use pinning when
--cpus-per-task is seemingly sufficient?).
- The differentiation between mapping and binding is not that clear at the beginning, I think. "tell[ing] where your processes will be placed" and "locking your MPI processes/threads to a specific resource" sound very similar to me at least. The example later clears this up, but I think the definition of mapping can be sharpened a bit more to avoid any initial confusion (one quick suggestion which might not be accurate: "Mapping is about selecting which resources are available to your job in total, while binding is about locking each MPI rank and/or thread inside your job to a specific resource.").
- I feel like the discussion on binding could use a a bit more nuance. I'm not really convinced that binding really is necessary for every kind of job. In some earlier testing with perf I saw that even without pinning, there are very few CPU migrations, i.e., the Linux kernel seems to be smart enough to schedule the same thread to the same core. What is of course very important is to avoid overcomitting cores, but that is also achievable with
--cpus-per-task. If we want to convince people that CPU pinning is actually necessary, I think we should show an example where --cpus-per-task is insufficient (i.e., gives worse performance than with pinning). I haven't tested this, but I would suspect that the raytracer does not care too much about pinning as there is very little communication between threads. Maybe one could make an example where multiple threads in different NUMA domains are hammering one cache line or something, so you could observe that that is slower than if all threads were running in the same NUMA domain.
Here are my comments for the review of episode 6:
mpirunvssrun,-nvs-npvs "let Slurm decide"). Not so much an issue within this episode, but between them.--report-bindingsis probably not beginner-friendly enough that you can just expect people to understand it. There should be an example output and some text going through each bit of information within.--cpus-per-taskis seemingly sufficient?).--cpus-per-task. If we want to convince people that CPU pinning is actually necessary, I think we should show an example where--cpus-per-taskis insufficient (i.e., gives worse performance than with pinning). I haven't tested this, but I would suspect that the raytracer does not care too much about pinning as there is very little communication between threads. Maybe one could make an example where multiple threads in different NUMA domains are hammering one cache line or something, so you could observe that that is slower than if all threads were running in the same NUMA domain.