You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: fault-tolerance/README.md
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -85,7 +85,7 @@ Here we have 47 nodes being used (`alloc`), 23 available (`idle`) and 4 unavaila
85
85
86
86
The sysadmin is expected to periodically check the drained nodes, fix or replace them and then make them again available to be used by changing their state to `idle`.
87
87
88
-
The other approach is to daisy-chain jobs via `--dependency` as explained [here](../slurm/users.md#request-allocation-via-dependency). Both of these approaches could also be combined.
88
+
The other approach is to daisy-chain jobs via `--dependency` as explained [here](../orchestration/slurm/users.md#request-allocation-via-dependency). Both of these approaches could also be combined.
89
89
90
90
How do you know when the job array or a daisy chain should not resume - well, normally the training loop will exit immediately if it knows the job is done. But you could also add features like [kill switch](#kill-switch) which are even easier to use to prevent a job array from running.
91
91
@@ -111,7 +111,7 @@ In many SLURM environments users have no `sudo` access and when one user started
111
111
112
112
This was the situation during BLOOM-176B training and we implemented a kill-switch to handle that. The mechanism is very simple. The training loop polls for a specific file to appear before starting a new iteration and if the file is there the program saves the checkpoint and exits, allowing users other than the one who started the previous training to change things and restart it again. An additional poll was added at the very beginning of `main` so that if there was a long job array queued by the user who is asleep they could be "burned through" quickly by getting each job exit quickly on start.
113
113
114
-
This is also discussed [here](../slurm/users.md#overcoming-the-lack-of-group-slurm-job-ownership).
114
+
This is also discussed [here](../orchestration/slurm/users.md#overcoming-the-lack-of-group-slurm-job-ownership).
115
115
116
116
This facility helps to minimize the amount of wasted training time.
117
117
@@ -143,7 +143,7 @@ To setup a crontab, execute `crontab -e` and check which jobs are scheduled `cro
143
143
144
144
The reason I don't go into many details is because many SLURM environments don't provide access to the `crontab` facility. And therefore one needs to use other approaches to scheduling jobs.
145
145
146
-
The section on [Crontab Emulation](../slurm/users.md#crontab-emulation) discusses how to implement crontab-like SLURM emulation and also [Self-perpetuating SLURM jobs](../slurm/users.md#self-perpetuating-slurm-jobs).
146
+
The section on [Crontab Emulation](../orchestration/slurm/users.md#crontab-emulation) discusses how to implement crontab-like SLURM emulation and also [Self-perpetuating SLURM jobs](../orchestration/slurm/users.md#self-perpetuating-slurm-jobs).
147
147
148
148
149
149
### Notification facility
@@ -160,7 +160,7 @@ Once you understand how to schedule watchdogs and you have a notification facili
160
160
161
161
The most obvious watchdog is one which checks that there is a training SLURM job running or more are scheduled to run.
162
162
163
-
Here is an example [slurm-status.py](./slurm-status.py) that was used during BLOOM-176B training. This watchdog was sending an email if a job was detected to be neither running nor scheduled and it was also piping its check results into the main training's log file. As we used [Crontab Emulation](../slurm/users.md#crontab-emulation), we simply needed to drop [slurm-status.slurm](./slurm-status.slurm) into the `cron/cron.hourly/` folder and the previously launched SLURM crontab emulating scheduler would launch this check approximately once an hour.
163
+
Here is an example [slurm-status.py](orchestration/slurm-status.py) that was used during BLOOM-176B training. This watchdog was sending an email if a job was detected to be neither running nor scheduled and it was also piping its check results into the main training's log file. As we used [Crontab Emulation](../orchestration/slurm/users.md#crontab-emulation), we simply needed to drop [slurm-status.slurm](orchestration/slurm-status.slurm) into the `cron/cron.hourly/` folder and the previously launched SLURM crontab emulating scheduler would launch this check approximately once an hour.
Copy file name to clipboardExpand all lines: insights/ai-battlefield.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -140,7 +140,7 @@ Cons:
140
140
There aren't that many HPCs out there and so the amount of available resources is limited.
141
141
142
142
Pros:
143
-
- Managed for you - all you need is your software to do the training and a bit of [SLURM](../slurm) know-how to launch jobs
143
+
- Managed for you - all you need is your software to do the training and a bit of [SLURM](../orchestration/slurm) know-how to launch jobs
144
144
- Often sponsored by the local government/university - probably could get the job done for less $$ or even free (e.g. we trained [BLOOM-176B](https://huggingface.co/bigscience/bloom) for free on [JeanZay HPC](http://www.idris.fr/eng/jean-zay/)!)
0 commit comments