03-slurm.Rmd

# SLURM: HPC scheduler

If you have written some scripts and want to execute them, it is advisable to send them to the scheduler.
The scheduler (SLURM) will distribute the jobs across the cluster (6+ machines) and make sure that there are no conflicts with respect to CPU and memory if multiple people send jobs to the cluster.
This is the essential job of a scheduler.

## First steps

Sending jobs to SLURM in R is supported via the R package [clustermq](https://github.com/mschubert/clustermq).

```{block, type='rmdcaution'}
The R interpreters and packages are not shared with RSW.
Therefore, all R packages your script needs must to be reinstalled on the HPC with the respective R version.
```

Rather than calling a R script directly, you need to wrap your code into a function and invoke it using `clustermq::Q()`.
Instead of using `clustermq` directly, you can make use of R packages like [{targets}](https://cran.r-project.org/web/packages/targets/index.html) or [{drake}](https://cran.r-project.org/web/packages/drake/index.html) to automatically wrap your whole analysis in a way that it executes all layers of your analysis on the HPC.

There is no other way to submit your R jobs to the compute nodes of the cluster than by using any of the tools mentioned above.

Also, it is essential to load all required system libraries you need (e.g. GDAL, PROJ) via environment modules so that they are available on all nodes.

```{block, type='rmdcaution'}
Note that most likely the versions of these libraries will differ to the ones used in the RSW container.
For reproducibility it might be worth not deviating too much or even using the same versions on the HPC and within RSW.
``` 

## SLURM commands

While the execution of jobs is explained in more detail in [Chapter 4](#submit-jobs), the following section aims familiarizing yourself with the usage of the scheduler.
The scheduler is queried via the terminal, i.e. you need to `ssh` into the server or switch to the "Terminal" tab in RStudio.

The most important SLURM commands are

- `sinfo`: An overview of the current state of the nodes

```sh
sinfo

PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*            up   infinite      4  alloc c[0-2],edi
all*            up   infinite      2   idle c[3-4]
frontend        up   infinite      1  alloc edi
threadripper    up   infinite      4  alloc c[0-2],edi
opteron         up   infinite      2   idle c[3-4]
```

- `squeue`: An overview of the current jobs that are queued, including information about running jobs

```sh
squeue

JOBID     PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
129_[2-5]    threadripper  cmq7381  patrick PD       0:00      1 (Resources)
121_2        threadripper  cmq7094  patrick  R    6:24:17      1 c1
121_3        threadripper  cmq7094  patrick  R    6:24:17      1 c2
129_1        threadripper  cmq7381  patrick  R    5:40:44      1 c0
```

- `sacct`: Overview of jobs that were submitted in the past including their end state

```sh
122             cmq7094     threadripper     (null)          0  COMPLETED      0:0
123             cmq7094     threadripper     (null)          0    PENDING      0:0
121             cmq7094     threadripper     (null)          0    PENDING      0:0
125             cmq6623     threadripper     (null)          0     FAILED      1:0
126             cmq6623     threadripper     (null)          0     FAILED      1:0
127             cmq6623     threadripper     (null)          0     FAILED      1:0
128             cmq6623     threadripper     (null)          0     FAILED      1:0
124             cmq6623     threadripper     (null)          0     FAILED      1:0
130             cmq7381     threadripper     (null)          0    PENDING      0:0
```

- `scancel`: Cancel running jobs using the job ID identifier

  If you want to cancel all jobs for your specific user, you can call `scancel -u <username>`.

## Submitting jobs {#submit-jobs}

### `clustermq` setup

Every job submission is done via `clustermq::Q()` (either directly or via `drake`). 
See the setup instructions in the [clustermq](https://mschubert.github.io/clustermq/) package on how to setup the package.

First, you need to set some options in your `.Rprofile` (on the master node or in your project root when you use {renv} or {packrat}):

```r
options(
    clustermq.scheduler = "slurm",
    clustermq.template = "</path/to/file/"
)
```

See the [package vignette](https://mschubert.github.io/clustermq/articles/userguide.html#slurm) on how to set up the file.

Note that you can have multiple `.Rprofile` files on your system:

1. Your default R interpreter will use the `.Rprofile` found in the home directory (`~/`).
1. But you can also save an `.Rprofile` file in the root directory of a (RStudio) project (which will be preferred over the one in $HOME). 

This way you can use customized `.Rprofile` files tailored to a project.

At this stage you should be able to run the [example](https://github.com/mschubert/clustermq) at the top of the `README` of the {clustermq} package.
It is a very simple example which finishes in a few seconds.
If it does not work, you either did something wrong or the nodes are busy.
Check with `sinfo` and `squeue`.
Otherwise see the [troubleshooting](#troubleshooting) chapter.

```{block, type='rmdcaution'}
Be aware of setting `n_cpus` in the `template` argument of `clustermq::Q()` if your submitted job is parallelized!
If you submit a job that is parallelized without telling the scheduler, the scheduler will reserve 1 core for this job (because it thinks it is sequential) but in fact multiple processes will spawn. 
This will potentially affect all running processes on the server since the scheduler will accept more processing than it actually can take.
```

### The scheduler template

To successfully submit jobs to the scheduler, you need to set the `.Rprofile` options given above.
Note that you can add any bash commands into the scripts between the `SBATCH` section and the final R call.

For example, a template could look as follows:

```sh
#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=all
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --cpus-per-task={{ n_cpus }}
#SBATCH --mem={{ memory }}
#SBATCH --array=1-{{ n_jobs }}

source ~/.bashrc
cd /full/path/to/project

# load desired R version via an env module
module load r-3.5.2-gcc-9.2.0-4syrmqv

CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
```

Note: The `#` signs are no mistakes here, they are no "comment" signs in this context. 
The `SBATCH` commands will be executed here.

You can simply copy it and adjust it to your needs.
You only need to set the right path to your project and specify the R version you want to use.

### Allocating resources

There are two approaches/packages you can use: 

- `drake` / `targets`

- `clustermq`

The `drake` approach is only valid if you have set up your project as a `drake` or `targets` project.

```{r eval = FALSE}
drake::make(parallelism = "clustermq", n_jobs = 1, 
  template = list(n_cpus = <X>, log_file = <Y>, memory = <Z>))
```

```{r eval = FALSE}
clustermq::Q(template = list(n_cpus = <X>, log_file = <Y>, memory = <Z>))
```

(The individual components of these calls are explained in more detail below.)

Note that `drake` uses `clustermq` under the hood. 
Notations like `<X>` are meant to be read as placeholders, meaning they need to be replaced with valid content.)

When submitting jobs via `clustermq::Q()`, it is important to tell the scheduler how many cores and memory should be reserved for you.
This step is very important.

If you specify less cores than you actually use in your script (e.g. by internal parallelization), the scheduler will plan with X cores although your submitted code will spawn Y processes in the background.
This might overload the node and eventually cause your script (and more importantly) the processes of others to crash.

There are two ways to specify these settings, depending on which approach you use:

1. via `clustermq::Q()` directly

Pass the values via argument `template` like `template = list(n_cpus = <X>, memory = <Y>)`.
It will then be passed to the `clustermq.template` file (frequently named `slurm_clustermq.tmpl`) which contains following lines:

```sh
#SBATCH --cpus-per-task{{ n_cpus }}
#SBATCH --mem={{ memory }}
```

This tells the scheduler how many resources (here cpus) your job needs.

2. via `drake::make()`

Again, set the options via argument `template = list(n_cpus = X, memory = Y)`.
See section ["The resources column for transient workers"](https://ropenscilabs.github.io/drake-manual/hpc.html#advanced-options) in the drake manual.

```{block, type='rmdcaution'}
Please think upfront how many cpus and memory your task requires. 
The following two examples show you the implications of wrong specifications.
```

```{block, type='rmdcaution'}
`mclapply(cores = 20)` (in your script) > `n_cpus = 16`

In this case, four workers will always be in "waiting mode" since only 16 cpus can be used by your resource request. 
This slows down your parallelization but does no harm to other users. 
```

```{block, type='rmdcaution'}
`mclapply(cores = 11)` < `n_cpus = 16`

In this case, you reserve 16 CPUs from the machine but only use 11 at most. 
This blocks five CPUs of the machine for no reason potentially causing other people to be added to the queue rather than getting their job processed immediately.
```

Furthermore, if you want to use all resources of a node and run into memory problems, try reducing the number of CPUs (if you already increased the memory to its maximum).
If you scale down the number of CPUs, you will have more memory/cpu available.

### Monitoring progress

When submitting jobs you can track its progress by specifying a `log_file` in the `clustermq::Q()` call, e.g. `clustermq::Q(template = list(log_file = path/to/file))`.

For `drake`, the equivalent is to specify `console_log_file()` in either `make()` or `drake_config()`.

If your jobs are running on a node, you can SSH into the node, e.g. `ssh c0`.
There you can take a look at the current load by using `htop`.
Note that you can only log in if you have a running progress on a specific node.

### `renv` specifics

If {renv} is used and jobs should be sent from within RSW, Slurm tries to load {clustermq} and {renv} from the following library 

```
<your/project/renv/library/linux-centos-7/R-4.0/x86_64-pc-linux-gnu/`
```

This library is not used by default and only in this very special occasion (Slurm + RSW).
The reason for this is that Slurm thinks its on CentoOS when invoking the `CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'` call and tries to find {clustermq} in this specific library.

When working directly on the HPC via a terminal, the {renv} library path is `renv/library/R-4.0/x86_64-pc-linux-gnu/`.

Simply copying {clustermq} and {renv} to this location is enough:

```
mkdir renv/library/linux-centos-7/R-4.0/x86_64-pc-linux-gnu
cp -R renv/library/R-4.0/x86_64-pc-linux-gnu/clustermq renv/library/linux-centos-7/R-4.0/x86_64-pc-linux-gnu/
cp -R renv/library/R-4.0/x86_64-pc-linux-gnu/renv renv/library/linux-centos-7/R-4.0/x86_64-pc-linux-gnu/
```

### RStudio Slurm Job Launcher Plugin 

While it would simplify some things to use the Launcher GUI in RStudio, the problem is that one requirement is to have R versions shared across all nodes.
Since the RSW container uses its one R versions and is decoupled from the R environment modules used on the HPC, adding these would duplicate the R versions in the container and create confusion.

Also it seems the RStudio GUI does not allow to load additional env modules which is a requirement for loading certain R packages.

## Summary

1. Set up your `.Rprofile` with `options(clustermq.template = "/path/to/file")`.
  The `clustermq.template` should point to a SLURM template file in your $HOME or project directory.

1. Decide which approach you want to use `drake`/`targets` or `clustermq`

2. A Slurm template file is required.
  This template needs to be linked in your `.Rprofile` with `options(clustermq.template = "/path/to/file")`.