Skip to content

Commit 3bc7c17

Browse files
committed
snakemake episode
1 parent 5d89b49 commit 3bc7c17

File tree

2 files changed

+244
-1
lines changed

2 files changed

+244
-1
lines changed

content/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ them to own projects**.
8585

8686
- 10:15-11:30 - Modular code development
8787
- {doc}`refactoring-concepts`
88-
- How to parallelize independent tasks using workflows
88+
- {doc}`snakemake`
8989

9090
- 11:30-12:15 - Lunch break
9191

@@ -134,6 +134,7 @@ good-practices
134134
profiling
135135
testing
136136
refactoring-concepts
137+
snakemake
137138
software-licensing
138139
publishing
139140
packaging

content/snakemake.md

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
# How to parallelize independent tasks using workflows (example: [Snakemake](https://snakemake.github.io/))
2+
3+
:::{objectives}
4+
- Understand the concept of a workflow management tool.
5+
- Instead of thinking in terms of individual step-by-step commands, think in
6+
terms of **dependencies** (**rules**).
7+
- Try to port our computational pipeline to Snakemake.
8+
- See how Snakemake can identify independent steps and run them in parallel.
9+
- It is not our goal to remember all the details of Snakemake.
10+
:::
11+
12+
13+
## The problem
14+
15+
Imagine we want to process a large number of similar input data.
16+
17+
This could be one way to do it:
18+
```bash
19+
#!/usr/bin/env bash
20+
21+
num_rounds=10
22+
23+
for i in $(seq -w 1 ${num_rounds}); do
24+
./conda.sif python generate_data.py \
25+
--num-samples 2000 \
26+
--training-data data/train_${i}.csv \
27+
--test-data data/test_${i}.csv
28+
29+
./conda.sif python generate_predictions.py \
30+
--num-neighbors 7 \
31+
--training-data data/train_${i}.csv \
32+
--test-data data/test_${i}.csv
33+
--predictions results/predictions_${i}.csv
34+
35+
./conda.sif python plot_results.py \
36+
--training-data data/train_${i}.csv \
37+
--predictions results/predictions_${i}.csv \
38+
--output-chart results/chart_${i}.png
39+
done
40+
```
41+
42+
Discuss possible problems with this approach.
43+
44+
45+
## Thinking in terms of dependencies
46+
47+
For the following we will assume that we have the input data available:
48+
```bash
49+
#!/usr/bin/env bash
50+
51+
num_rounds=10
52+
53+
for i in $(seq -w 1 ${num_rounds}); do
54+
./conda.sif python generate_data.py \
55+
--num-samples 2000 \
56+
--training-data data/train_${i}.csv \
57+
--test-data data/test_${i}.csv
58+
done
59+
```
60+
61+
From here on we will **focus on the processing part**.
62+
63+
The central file in Snakemake is the `snakefile`. This is how we can express
64+
the pipeline in Snakemake (below we will explain it):
65+
```
66+
# the comma is there because glob_wildcards returns a named tuple
67+
numbers, = glob_wildcards("data/train_{number}.csv")
68+
69+
70+
# rule that collects the target files
71+
rule all:
72+
input:
73+
expand("results/chart_{number}.svg", number=numbers)
74+
75+
76+
rule chart:
77+
input:
78+
script="plot_results.py",
79+
predictions="results/predictions_{number}.csv",
80+
training="data/train_{number}.csv"
81+
output:
82+
"results/chart_{number}.svg"
83+
log:
84+
"logs/chart_{number}.txt"
85+
shell:
86+
"""
87+
python {input.script} --training-data {input.training} --predictions {input.predictions} --output-chart {output}
88+
"""
89+
90+
91+
rule predictions:
92+
input:
93+
script="generate_predictions.py",
94+
training="data/train_{number}.csv",
95+
test="data/test_{number}.csv"
96+
output:
97+
"results/predictions_{number}.csv"
98+
log:
99+
"logs/predictions_{number}.txt"
100+
shell:
101+
"""
102+
python {input.script} --num-neighbors 7 --training-data {input.training} --test-data {input.test} --predictions {output}
103+
"""
104+
```
105+
106+
**Explanation**:
107+
- The `snakefile` contains 3 **rules** and it will run the "all" rule by default unless
108+
we ask it to produce a different "target":
109+
- "all"
110+
- "chart"
111+
- "predictions"
112+
- Rules "predictions" and "chart" depend on **input** and produce **output**.
113+
- Note how "all" depends on the output of the "chart" rule and how "chart" depends
114+
on the output of the "predictions" rule.
115+
- The **shell** part of the rule shows how to produce the output from the
116+
input.
117+
- We ask Snakemake to collect all files that match `"data/train_{number}.csv"`
118+
and from this to infer the **wildcards** `{number}`.
119+
- Later we can refer to `{number}` throughout the `snakefile`.
120+
- This part defines what we want to have at the end:
121+
```
122+
rule all:
123+
input:
124+
expand("results/chart_{number}.svg", number=numbers)
125+
```
126+
- Rules correspond to steps and parameter scanning can be done with wildcards.
127+
128+
129+
## Exercise
130+
131+
::::{exercise} Exercise: Practicing with Snakemake
132+
1. Create a `snakefile` (above) and run it with Snakemake (adjust number of
133+
cores):
134+
```console
135+
$ snakemake --cores 8
136+
```
137+
138+
1. Check the output. Did it use all available cores? How did it know
139+
which steps it can start in parallel?
140+
141+
1. Run Snakemake again. Now it should finish almost immediately because all the
142+
results are already there. Aha! Snakemake does not repeat steps that are
143+
already done. You can force it to re-run all with `snakemake
144+
--delete-all-output`.
145+
146+
1. Remove few files from `results/` and run it again.
147+
Snakemake should now only re-run the steps that are necessary to get the
148+
deleted files.
149+
150+
1. Modify `generate_predictions.py` which is used in the rule "predictions".
151+
Now Snakemake will only re-run the steps that depend on this script.
152+
153+
1. It is possible to only process one file with (useful for testing):
154+
```console
155+
$ snakemake results/predictions_09.csv
156+
```
157+
158+
1. Add few more data files to the input directory `data/` (for instance by
159+
copying some existing ones) and observe how Snakemake will pick them up next
160+
time you run the workflow, without changing the `snakefile`.
161+
::::
162+
163+
164+
## What else is possible?
165+
166+
- With the option `--keep-going` we can tell Snakemake to not give up on first failure.
167+
168+
- The option `--restart-times 3` would tell Snakemake to try to restart a rule
169+
up to 3 times if it fails.
170+
171+
- It is possible to tell Snakemake to use a **locally mounted file system** instead
172+
of the default network file system for certain rules
173+
([documentation](https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/fs.html)).
174+
175+
- Sometimes you need to run different rules inside different software
176+
environments (e.g. **Conda environments**) and this is possible.
177+
178+
- A lot **more is possible**:
179+
- [Snakemake documentation](https://snakemake.readthedocs.io)
180+
- [Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html)
181+
182+
183+
## More elaborate example
184+
185+
In this example below we also scan over a range of numbers for the `--num-neighbors` parameter:
186+
:::{code-block}
187+
:emphasize-lines: 7,40-41
188+
189+
190+
# the comma is there because glob_wildcards returns a named tuple
191+
numbers, = glob_wildcards("data/train_{number}.csv")
192+
193+
194+
# define the parameter scan for num-neighbors
195+
neighbor_values = [1, 3, 5, 7, 9, 11]
196+
197+
198+
# rule that collects all target files
199+
rule all:
200+
input:
201+
expand("results/chart_{number}_{num_neighbors}.svg", number=numbers, num_neighbors=neighbor_values)
202+
203+
204+
rule chart:
205+
input:
206+
script="plot_results.py",
207+
predictions="results/predictions_{number}_{num_neighbors}.csv",
208+
training="data/train_{number}.csv"
209+
output:
210+
"results/chart_{number}_{num_neighbors}.svg"
211+
log:
212+
"logs/chart_{number}_{num_neighbors}.txt"
213+
shell:
214+
"""
215+
python {input.script} --training-data {input.training} --predictions {input.predictions} --output-chart {output}
216+
"""
217+
218+
219+
rule predictions:
220+
input:
221+
script="generate_predictions.py",
222+
training="data/train_{number}.csv",
223+
test="data/test_{number}.csv"
224+
output:
225+
"results/predictions_{number}_{num_neighbors}.csv"
226+
log:
227+
"logs/predictions_{number}_{num_neighbors}.txt"
228+
params:
229+
num_neighbors=lambda wildcards: wildcards.num_neighbors
230+
shell:
231+
"""
232+
python {input.script} --num-neighbors {params.num_neighbors} --training-data {input.training} --test-data {input.test} --predictions {output}
233+
"""
234+
:::
235+
236+
237+
## Other workflow management tools
238+
239+
- Another very popular one is [Nextflow](https://www.nextflow.io/).
240+
241+
- Many workflow management tools exist
242+
([overview](https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems)).

0 commit comments

Comments
 (0)