|
| 1 | +# How to parallelize independent tasks using workflows (example: [Snakemake](https://snakemake.github.io/)) |
| 2 | + |
| 3 | +:::{objectives} |
| 4 | +- Understand the concept of a workflow management tool. |
| 5 | +- Instead of thinking in terms of individual step-by-step commands, think in |
| 6 | + terms of **dependencies** (**rules**). |
| 7 | +- Try to port our computational pipeline to Snakemake. |
| 8 | +- See how Snakemake can identify independent steps and run them in parallel. |
| 9 | +- It is not our goal to remember all the details of Snakemake. |
| 10 | +::: |
| 11 | + |
| 12 | + |
| 13 | +## The problem |
| 14 | + |
| 15 | +Imagine we want to process a large number of similar input data. |
| 16 | + |
| 17 | +This could be one way to do it: |
| 18 | +```bash |
| 19 | +#!/usr/bin/env bash |
| 20 | + |
| 21 | +num_rounds=10 |
| 22 | + |
| 23 | +for i in $(seq -w 1 ${num_rounds}); do |
| 24 | + ./conda.sif python generate_data.py \ |
| 25 | + --num-samples 2000 \ |
| 26 | + --training-data data/train_${i}.csv \ |
| 27 | + --test-data data/test_${i}.csv |
| 28 | + |
| 29 | + ./conda.sif python generate_predictions.py \ |
| 30 | + --num-neighbors 7 \ |
| 31 | + --training-data data/train_${i}.csv \ |
| 32 | + --test-data data/test_${i}.csv |
| 33 | + --predictions results/predictions_${i}.csv |
| 34 | + |
| 35 | + ./conda.sif python plot_results.py \ |
| 36 | + --training-data data/train_${i}.csv \ |
| 37 | + --predictions results/predictions_${i}.csv \ |
| 38 | + --output-chart results/chart_${i}.png |
| 39 | +done |
| 40 | +``` |
| 41 | + |
| 42 | +Discuss possible problems with this approach. |
| 43 | + |
| 44 | + |
| 45 | +## Thinking in terms of dependencies |
| 46 | + |
| 47 | +For the following we will assume that we have the input data available: |
| 48 | +```bash |
| 49 | +#!/usr/bin/env bash |
| 50 | + |
| 51 | +num_rounds=10 |
| 52 | + |
| 53 | +for i in $(seq -w 1 ${num_rounds}); do |
| 54 | + ./conda.sif python generate_data.py \ |
| 55 | + --num-samples 2000 \ |
| 56 | + --training-data data/train_${i}.csv \ |
| 57 | + --test-data data/test_${i}.csv |
| 58 | +done |
| 59 | +``` |
| 60 | + |
| 61 | +From here on we will **focus on the processing part**. |
| 62 | + |
| 63 | +The central file in Snakemake is the `snakefile`. This is how we can express |
| 64 | +the pipeline in Snakemake (below we will explain it): |
| 65 | +``` |
| 66 | +# the comma is there because glob_wildcards returns a named tuple |
| 67 | +numbers, = glob_wildcards("data/train_{number}.csv") |
| 68 | +
|
| 69 | +
|
| 70 | +# rule that collects the target files |
| 71 | +rule all: |
| 72 | + input: |
| 73 | + expand("results/chart_{number}.svg", number=numbers) |
| 74 | +
|
| 75 | +
|
| 76 | +rule chart: |
| 77 | + input: |
| 78 | + script="plot_results.py", |
| 79 | + predictions="results/predictions_{number}.csv", |
| 80 | + training="data/train_{number}.csv" |
| 81 | + output: |
| 82 | + "results/chart_{number}.svg" |
| 83 | + log: |
| 84 | + "logs/chart_{number}.txt" |
| 85 | + shell: |
| 86 | + """ |
| 87 | + python {input.script} --training-data {input.training} --predictions {input.predictions} --output-chart {output} |
| 88 | + """ |
| 89 | +
|
| 90 | +
|
| 91 | +rule predictions: |
| 92 | + input: |
| 93 | + script="generate_predictions.py", |
| 94 | + training="data/train_{number}.csv", |
| 95 | + test="data/test_{number}.csv" |
| 96 | + output: |
| 97 | + "results/predictions_{number}.csv" |
| 98 | + log: |
| 99 | + "logs/predictions_{number}.txt" |
| 100 | + shell: |
| 101 | + """ |
| 102 | + python {input.script} --num-neighbors 7 --training-data {input.training} --test-data {input.test} --predictions {output} |
| 103 | + """ |
| 104 | +``` |
| 105 | + |
| 106 | +**Explanation**: |
| 107 | +- The `snakefile` contains 3 **rules** and it will run the "all" rule by default unless |
| 108 | + we ask it to produce a different "target": |
| 109 | + - "all" |
| 110 | + - "chart" |
| 111 | + - "predictions" |
| 112 | +- Rules "predictions" and "chart" depend on **input** and produce **output**. |
| 113 | +- Note how "all" depends on the output of the "chart" rule and how "chart" depends |
| 114 | + on the output of the "predictions" rule. |
| 115 | +- The **shell** part of the rule shows how to produce the output from the |
| 116 | + input. |
| 117 | +- We ask Snakemake to collect all files that match `"data/train_{number}.csv"` |
| 118 | + and from this to infer the **wildcards** `{number}`. |
| 119 | +- Later we can refer to `{number}` throughout the `snakefile`. |
| 120 | +- This part defines what we want to have at the end: |
| 121 | + ``` |
| 122 | + rule all: |
| 123 | + input: |
| 124 | + expand("results/chart_{number}.svg", number=numbers) |
| 125 | + ``` |
| 126 | +- Rules correspond to steps and parameter scanning can be done with wildcards. |
| 127 | + |
| 128 | + |
| 129 | +## Exercise |
| 130 | + |
| 131 | +::::{exercise} Exercise: Practicing with Snakemake |
| 132 | +1. Create a `snakefile` (above) and run it with Snakemake (adjust number of |
| 133 | + cores): |
| 134 | + ```console |
| 135 | + $ snakemake --cores 8 |
| 136 | + ``` |
| 137 | + |
| 138 | +1. Check the output. Did it use all available cores? How did it know |
| 139 | + which steps it can start in parallel? |
| 140 | + |
| 141 | +1. Run Snakemake again. Now it should finish almost immediately because all the |
| 142 | + results are already there. Aha! Snakemake does not repeat steps that are |
| 143 | + already done. You can force it to re-run all with `snakemake |
| 144 | + --delete-all-output`. |
| 145 | + |
| 146 | +1. Remove few files from `results/` and run it again. |
| 147 | + Snakemake should now only re-run the steps that are necessary to get the |
| 148 | + deleted files. |
| 149 | + |
| 150 | +1. Modify `generate_predictions.py` which is used in the rule "predictions". |
| 151 | + Now Snakemake will only re-run the steps that depend on this script. |
| 152 | + |
| 153 | +1. It is possible to only process one file with (useful for testing): |
| 154 | + ```console |
| 155 | + $ snakemake results/predictions_09.csv |
| 156 | + ``` |
| 157 | + |
| 158 | +1. Add few more data files to the input directory `data/` (for instance by |
| 159 | + copying some existing ones) and observe how Snakemake will pick them up next |
| 160 | + time you run the workflow, without changing the `snakefile`. |
| 161 | +:::: |
| 162 | + |
| 163 | + |
| 164 | +## What else is possible? |
| 165 | + |
| 166 | +- With the option `--keep-going` we can tell Snakemake to not give up on first failure. |
| 167 | + |
| 168 | +- The option `--restart-times 3` would tell Snakemake to try to restart a rule |
| 169 | + up to 3 times if it fails. |
| 170 | + |
| 171 | +- It is possible to tell Snakemake to use a **locally mounted file system** instead |
| 172 | + of the default network file system for certain rules |
| 173 | + ([documentation](https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/fs.html)). |
| 174 | + |
| 175 | +- Sometimes you need to run different rules inside different software |
| 176 | + environments (e.g. **Conda environments**) and this is possible. |
| 177 | + |
| 178 | +- A lot **more is possible**: |
| 179 | + - [Snakemake documentation](https://snakemake.readthedocs.io) |
| 180 | + - [Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html) |
| 181 | + |
| 182 | + |
| 183 | +## More elaborate example |
| 184 | + |
| 185 | +In this example below we also scan over a range of numbers for the `--num-neighbors` parameter: |
| 186 | +:::{code-block} |
| 187 | +:emphasize-lines: 7,40-41 |
| 188 | + |
| 189 | + |
| 190 | +# the comma is there because glob_wildcards returns a named tuple |
| 191 | +numbers, = glob_wildcards("data/train_{number}.csv") |
| 192 | + |
| 193 | + |
| 194 | +# define the parameter scan for num-neighbors |
| 195 | +neighbor_values = [1, 3, 5, 7, 9, 11] |
| 196 | + |
| 197 | + |
| 198 | +# rule that collects all target files |
| 199 | +rule all: |
| 200 | + input: |
| 201 | + expand("results/chart_{number}_{num_neighbors}.svg", number=numbers, num_neighbors=neighbor_values) |
| 202 | + |
| 203 | + |
| 204 | +rule chart: |
| 205 | + input: |
| 206 | + script="plot_results.py", |
| 207 | + predictions="results/predictions_{number}_{num_neighbors}.csv", |
| 208 | + training="data/train_{number}.csv" |
| 209 | + output: |
| 210 | + "results/chart_{number}_{num_neighbors}.svg" |
| 211 | + log: |
| 212 | + "logs/chart_{number}_{num_neighbors}.txt" |
| 213 | + shell: |
| 214 | + """ |
| 215 | + python {input.script} --training-data {input.training} --predictions {input.predictions} --output-chart {output} |
| 216 | + """ |
| 217 | + |
| 218 | + |
| 219 | +rule predictions: |
| 220 | + input: |
| 221 | + script="generate_predictions.py", |
| 222 | + training="data/train_{number}.csv", |
| 223 | + test="data/test_{number}.csv" |
| 224 | + output: |
| 225 | + "results/predictions_{number}_{num_neighbors}.csv" |
| 226 | + log: |
| 227 | + "logs/predictions_{number}_{num_neighbors}.txt" |
| 228 | + params: |
| 229 | + num_neighbors=lambda wildcards: wildcards.num_neighbors |
| 230 | + shell: |
| 231 | + """ |
| 232 | + python {input.script} --num-neighbors {params.num_neighbors} --training-data {input.training} --test-data {input.test} --predictions {output} |
| 233 | + """ |
| 234 | +::: |
| 235 | + |
| 236 | + |
| 237 | +## Other workflow management tools |
| 238 | + |
| 239 | +- Another very popular one is [Nextflow](https://www.nextflow.io/). |
| 240 | + |
| 241 | +- Many workflow management tools exist |
| 242 | + ([overview](https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems)). |
0 commit comments