Skip to content

Commit 35d1764

Browse files
committed
finish
1 parent 26333a9 commit 35d1764

File tree

12 files changed

+11
-11
lines changed

12 files changed

+11
-11
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ My apologies if the layout is a bit unstable while I'm writing new chapters and
4343

4444
**Part 4. Operating**
4545

46-
1. **[SLURM](./slurm/)**
46+
1. **[SLURM](orchestration/slurm)**
4747

4848
1. **[Training hyper-parameters and model initializations](./hparams/)**
4949

@@ -82,7 +82,7 @@ Tools:
8282
Guides:
8383

8484
- [debugging pytorch applications](./debug/pytorch.md) - quick copy-n-paste solutions to resolve hanging or breaking pytorch applications
85-
- [slurm for users](./slurm/users.md) - a slurm cheatsheet and tricks
85+
- [slurm for users](orchestration/slurm/users.md) - a slurm cheatsheet and tricks
8686
- [make tiny models/datasets/tokenizers](./transformers/make-tiny-models.md)
8787
- [LLM/VLM chronicles collection](https://github.com/stas00/ml-engineering/tree/master/resources#publicly-available-training-llmvlm-logbooks)
8888

chapters-md.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,10 @@
2424

2525
./model-parallelism/README.md
2626

27-
./slurm/README.md
28-
./slurm/admin.md
29-
./slurm/users.md
30-
./slurm/performance.md
27+
./orchestration/slurm/README.md
28+
./orchestration/slurm/admin.md
29+
./orchestration/slurm/users.md
30+
./orchestration/slurm/performance.md
3131

3232
./hparams/README.md
3333

fault-tolerance/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ Here we have 47 nodes being used (`alloc`), 23 available (`idle`) and 4 unavaila
8585

8686
The sysadmin is expected to periodically check the drained nodes, fix or replace them and then make them again available to be used by changing their state to `idle`.
8787

88-
The other approach is to daisy-chain jobs via `--dependency` as explained [here](../slurm/users.md#request-allocation-via-dependency). Both of these approaches could also be combined.
88+
The other approach is to daisy-chain jobs via `--dependency` as explained [here](../orchestration/slurm/users.md#request-allocation-via-dependency). Both of these approaches could also be combined.
8989

9090
How do you know when the job array or a daisy chain should not resume - well, normally the training loop will exit immediately if it knows the job is done. But you could also add features like [kill switch](#kill-switch) which are even easier to use to prevent a job array from running.
9191

@@ -111,7 +111,7 @@ In many SLURM environments users have no `sudo` access and when one user started
111111

112112
This was the situation during BLOOM-176B training and we implemented a kill-switch to handle that. The mechanism is very simple. The training loop polls for a specific file to appear before starting a new iteration and if the file is there the program saves the checkpoint and exits, allowing users other than the one who started the previous training to change things and restart it again. An additional poll was added at the very beginning of `main` so that if there was a long job array queued by the user who is asleep they could be "burned through" quickly by getting each job exit quickly on start.
113113

114-
This is also discussed [here](../slurm/users.md#overcoming-the-lack-of-group-slurm-job-ownership).
114+
This is also discussed [here](../orchestration/slurm/users.md#overcoming-the-lack-of-group-slurm-job-ownership).
115115

116116
This facility helps to minimize the amount of wasted training time.
117117

@@ -143,7 +143,7 @@ To setup a crontab, execute `crontab -e` and check which jobs are scheduled `cro
143143

144144
The reason I don't go into many details is because many SLURM environments don't provide access to the `crontab` facility. And therefore one needs to use other approaches to scheduling jobs.
145145

146-
The section on [Crontab Emulation](../slurm/users.md#crontab-emulation) discusses how to implement crontab-like SLURM emulation and also [Self-perpetuating SLURM jobs](../slurm/users.md#self-perpetuating-slurm-jobs).
146+
The section on [Crontab Emulation](../orchestration/slurm/users.md#crontab-emulation) discusses how to implement crontab-like SLURM emulation and also [Self-perpetuating SLURM jobs](../orchestration/slurm/users.md#self-perpetuating-slurm-jobs).
147147

148148

149149
### Notification facility
@@ -160,7 +160,7 @@ Once you understand how to schedule watchdogs and you have a notification facili
160160

161161
The most obvious watchdog is one which checks that there is a training SLURM job running or more are scheduled to run.
162162

163-
Here is an example [slurm-status.py](./slurm-status.py) that was used during BLOOM-176B training. This watchdog was sending an email if a job was detected to be neither running nor scheduled and it was also piping its check results into the main training's log file. As we used [Crontab Emulation](../slurm/users.md#crontab-emulation), we simply needed to drop [slurm-status.slurm](./slurm-status.slurm) into the `cron/cron.hourly/` folder and the previously launched SLURM crontab emulating scheduler would launch this check approximately once an hour.
163+
Here is an example [slurm-status.py](orchestration/slurm-status.py) that was used during BLOOM-176B training. This watchdog was sending an email if a job was detected to be neither running nor scheduled and it was also piping its check results into the main training's log file. As we used [Crontab Emulation](../orchestration/slurm/users.md#crontab-emulation), we simply needed to drop [slurm-status.slurm](orchestration/slurm-status.slurm) into the `cron/cron.hourly/` folder and the previously launched SLURM crontab emulating scheduler would launch this check approximately once an hour.
164164

165165
The key part of the SLURM job is:
166166
```

insights/ai-battlefield.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ Cons:
140140
There aren't that many HPCs out there and so the amount of available resources is limited.
141141

142142
Pros:
143-
- Managed for you - all you need is your software to do the training and a bit of [SLURM](../slurm) know-how to launch jobs
143+
- Managed for you - all you need is your software to do the training and a bit of [SLURM](../orchestration/slurm) know-how to launch jobs
144144
- Often sponsored by the local government/university - probably could get the job done for less $$ or even free (e.g. we trained [BLOOM-176B](https://huggingface.co/bigscience/bloom) for free on [JeanZay HPC](http://www.idris.fr/eng/jean-zay/)!)
145145

146146
Cons:
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)