stas00
diff --git a/‎README.md
Lines changed: 2 additions & 2 deletions b/‎README.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎chapters-md.txt
Lines changed: 4 additions & 4 deletions b/‎chapters-md.txt
Lines changed: 4 additions & 4 deletions
diff --git a/‎fault-tolerance/README.md
Lines changed: 4 additions & 4 deletions b/‎fault-tolerance/README.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎insights/ai-battlefield.md
Lines changed: 1 addition & 1 deletion b/‎insights/ai-battlefield.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎slurm/README.md renamed to ‎orchestration/slurm/README.md b/‎slurm/README.md renamed to ‎orchestration/slurm/README.md
diff --git a/‎slurm/admin.md renamed to ‎orchestration/slurm/admin.md b/‎slurm/admin.md renamed to ‎orchestration/slurm/admin.md
diff --git a/‎slurm/cron-daily.slurm renamed to ‎orchestration/slurm/cron-daily.slurm b/‎slurm/cron-daily.slurm renamed to ‎orchestration/slurm/cron-daily.slurm
diff --git a/‎slurm/cron-hourly.slurm renamed to ‎orchestration/slurm/cron-hourly.slurm b/‎slurm/cron-hourly.slurm renamed to ‎orchestration/slurm/cron-hourly.slurm
diff --git a/‎slurm/example.slurm renamed to ‎orchestration/slurm/example.slurm b/‎slurm/example.slurm renamed to ‎orchestration/slurm/example.slurm
diff --git a/‎slurm/performance.md renamed to ‎orchestration/slurm/performance.md b/‎slurm/performance.md renamed to ‎orchestration/slurm/performance.md
diff --git a/‎slurm/undrain-good-nodes.sh renamed to ‎orchestration/slurm/undrain-good-nodes.sh b/‎slurm/undrain-good-nodes.sh renamed to ‎orchestration/slurm/undrain-good-nodes.sh
diff --git a/‎slurm/users.md renamed to ‎orchestration/slurm/users.md b/‎slurm/users.md renamed to ‎orchestration/slurm/users.md
@@ -43,7 +43,7 @@ My apologies if the layout is a bit unstable while I'm writing new chapters and
 
 **Part 4. Operating**
 
-1. **[SLURM](./slurm/)**
+1. **[SLURM](orchestration/slurm)**
 
 1. **[Training hyper-parameters and model initializations](./hparams/)**
 
@@ -82,7 +82,7 @@ Tools:
 Guides:
 
 - [debugging pytorch applications](./debug/pytorch.md) - quick copy-n-paste solutions to resolve hanging or breaking pytorch applications
-- [slurm for users](./slurm/users.md) - a slurm cheatsheet and tricks
+- [slurm for users](orchestration/slurm/users.md) - a slurm cheatsheet and tricks
 - [make tiny models/datasets/tokenizers](./transformers/make-tiny-models.md)
 - [LLM/VLM chronicles collection](https://github.com/stas00/ml-engineering/tree/master/resources#publicly-available-training-llmvlm-logbooks)
 
 
@@ -24,10 +24,10 @@
 
 ./model-parallelism/README.md
 
-./slurm/README.md
-./slurm/admin.md
-./slurm/users.md
-./slurm/performance.md
+./orchestration/slurm/README.md
+./orchestration/slurm/admin.md
+./orchestration/slurm/users.md
+./orchestration/slurm/performance.md
 
 ./hparams/README.md
 
 
@@ -85,7 +85,7 @@ Here we have 47 nodes being used (`alloc`), 23 available (`idle`) and 4 unavaila
 
 The sysadmin is expected to periodically check the drained nodes, fix or replace them and then make them again available to be used by changing their state to `idle`.
 
-The other approach is to daisy-chain jobs via `--dependency` as explained [here](../slurm/users.md#request-allocation-via-dependency). Both of these approaches could also be combined.
+The other approach is to daisy-chain jobs via `--dependency` as explained [here](../orchestration/slurm/users.md#request-allocation-via-dependency). Both of these approaches could also be combined.
 
 How do you know when the job array or a daisy chain should not resume - well, normally the training loop will exit immediately if it knows the job is done. But you could also add features like [kill switch](#kill-switch) which are even easier to use to prevent a job array from running.
 
@@ -111,7 +111,7 @@ In many SLURM environments users have no `sudo` access and when one user started
 
 This was the situation during BLOOM-176B training and we implemented a kill-switch to handle that. The mechanism is very simple. The training loop polls for a specific file to appear before starting a new iteration and if the file is there the program saves the checkpoint and exits, allowing users other than the one who started the previous training to change things and restart it again. An additional poll was added at the very beginning of `main` so that if there was a long job array queued by the user who is asleep they could be "burned through" quickly by getting each job exit quickly on start.
 
-This is also discussed [here](../slurm/users.md#overcoming-the-lack-of-group-slurm-job-ownership).
+This is also discussed [here](../orchestration/slurm/users.md#overcoming-the-lack-of-group-slurm-job-ownership).
 
 This facility helps to minimize the amount of wasted training time.
 
@@ -143,7 +143,7 @@ To setup a crontab, execute `crontab -e` and check which jobs are scheduled `cro
 
 The reason I don't go into many details is because many SLURM environments don't provide access to the `crontab` facility. And therefore one needs to use other approaches to scheduling jobs.
 
-The section on [Crontab Emulation](../slurm/users.md#crontab-emulation) discusses how to implement crontab-like SLURM emulation and also [Self-perpetuating SLURM jobs](../slurm/users.md#self-perpetuating-slurm-jobs).
+The section on [Crontab Emulation](../orchestration/slurm/users.md#crontab-emulation) discusses how to implement crontab-like SLURM emulation and also [Self-perpetuating SLURM jobs](../orchestration/slurm/users.md#self-perpetuating-slurm-jobs).
 
 
 ### Notification facility
@@ -160,7 +160,7 @@ Once you understand how to schedule watchdogs and you have a notification facili
 
 The most obvious watchdog is one which checks that there is a training SLURM job running or more are scheduled to run.
 
-Here is an example [slurm-status.py](./slurm-status.py) that was used during BLOOM-176B training. This watchdog was sending an email if a job was detected to be neither running nor scheduled and it was also piping its check results into the main training's log file. As we used [Crontab Emulation](../slurm/users.md#crontab-emulation), we simply needed to drop  [slurm-status.slurm](./slurm-status.slurm) into the `cron/cron.hourly/` folder and the previously launched SLURM crontab emulating scheduler would launch this check approximately once an hour.
+Here is an example [slurm-status.py](orchestration/slurm-status.py) that was used during BLOOM-176B training. This watchdog was sending an email if a job was detected to be neither running nor scheduled and it was also piping its check results into the main training's log file. As we used [Crontab Emulation](../orchestration/slurm/users.md#crontab-emulation), we simply needed to drop  [slurm-status.slurm](orchestration/slurm-status.slurm) into the `cron/cron.hourly/` folder and the previously launched SLURM crontab emulating scheduler would launch this check approximately once an hour.
 
 The key part of the SLURM job is:
 ```
 
@@ -140,7 +140,7 @@ Cons:
 There aren't that many HPCs out there and so the amount of available resources is limited.
 
 Pros:
-- Managed for you - all you need is your software to do the training and a bit of [SLURM](../slurm) know-how to launch jobs
+- Managed for you - all you need is your software to do the training and a bit of [SLURM](../orchestration/slurm) know-how to launch jobs
 - Often sponsored by the local government/university - probably could get the job done for less $$ or even free (e.g. we trained [BLOOM-176B](https://huggingface.co/bigscience/bloom) for free on [JeanZay HPC](http://www.idris.fr/eng/jean-zay/)!)
 
 Cons: