Skip to content

Commit 1f9416e

Browse files
committed
add manifest, add docs, add staging, restructure refcon
1 parent cb94208 commit 1f9416e

File tree

12 files changed

+819
-166
lines changed

12 files changed

+819
-166
lines changed

README.md

Lines changed: 5 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -12,67 +12,17 @@ In essence, the following should suffice to get started:
1212
3. Conda (or mamba): for automatic setup of all software dependencies
1313
- Note that you can run the workflow w/o Conda, but then all tools listed under `workflow/envs/*.yaml` [not `dev_env.yaml`] need to be in your `$PATH`.
1414

15-
Internal (template) remark: adapt the above if the workflow deployment has additional requirements (e.g., Singularity).
15+
For a detailed setup guide, please refer to [the workflow documentation](docs/README.md).
16+
17+
**Internal (template) remark**: adapt the above if the workflow deployment has additional requirements (e.g., Singularity).
1618

1719
## Required input data
1820

19-
Add info here
21+
Add info here - be concise, and provide more details in [the workflow documentation](docs/README.md).
2022

2123
## Produced output data
2224

23-
Add info here
24-
25-
# Deploying the workflow on execution hardware
26-
27-
1. run `./init.py` (requires Python3)
28-
- this will create an "execution" Conda environment,
29-
and a Snakemake working directory plus standard subfolders
30-
one level above the repository location
31-
2. activate the created Conda environment: `cd .. && conda activate ./exec_env`
32-
3. prepare profile and configuration files as necessary, and run Snakemake
33-
34-
35-
# Developing the workflow locally
36-
37-
1. run `./init.py --dev-only` (requires Python3)
38-
- this will skip creating the workflow working directory and subfolders
39-
2. activate the created Conda environment: `conda activate ./dev_env`
40-
3. write your code, and add tests to `workflow/snaketests.smk`
41-
4. run tests:
42-
- note that some tests may be expected to fail and may produce error messages
43-
- if Snakemake reports a successful pipeline run, then all tests have succeeded
44-
irrespective of log messages that look like errors
45-
- if you want to test the functions loading reference data from reference containers,
46-
you need to build the test container `test_v0.sif` and copy it into the
47-
working directory for the workflow test run. Refer to the
48-
[reference container repository](https://github.com/core-unit-bioinformatics/reference-container)
49-
for build instructions.
50-
```bash
51-
# test w/o reference container
52-
snakemake --cores 1 \
53-
--config devmode=True \
54-
--directory wd/ \
55-
--snakefile workflow/snaketests.smk
56-
57-
# test w/ reference container
58-
# the container 'test_v0.sif' must exist
59-
# in the working directory: 'wd/test_v0.sif'
60-
snakemake --cores 1 \
61-
--config devmode=True \
62-
--directory wd/ \
63-
--configfiles config/testing/params_refcon.yaml \
64-
--snakefile workflow/snaketests.smk
65-
```
66-
67-
5. run the recommended code checks with the following tools:
68-
- Python scripts:
69-
- linting: `pylint <script.py>`
70-
- organize imports: `isort <script.py>`
71-
- code formatting: `black [--check] .`
72-
- Snakemake files:
73-
- linting: `snakemake --config devmode=True --lint`
74-
- code formatting: `snakefmt [--check] .`
75-
6. after checking and standardizing your code, commit and push your changes
25+
Add info here - be concise, and provide more details in [the workflow documentation](docs/README.md).
7626

7727
# Citation
7828

config/testing/params_refcon.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
reference_container_store: "../test_folder/"
2+
reference_container_store: "."
33
reference_container_names:
44
- test_v0
55
use_reference_container: True

docs/README.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Documentation for Snakemake workflow NAME HERE
2+
3+
Describe the purpose of the workflow (the big picture)
4+
5+
## General concepts
6+
7+
### Folder structure
8+
9+
By following the below guides, you will end up with a Snakemake
10+
working directory (for short: `wd`) with the following subfolders:
11+
12+
`wd/results/`: this folder contains final results and some workflow metadata
13+
and represents the only relevant output folder from the end user perspective.
14+
Accidentally deleting this folder after a successful pipeline run means you have to
15+
restart the pipeline and Snakemake will check which result files have to be recreated.
16+
17+
`wd/proc/`: the `processing` folder contains intermediate files
18+
and is not of interest to end users. As a design principle, deleting this
19+
folder after successful pipeline execution must not result in the loss of
20+
any relevant data.
21+
22+
The folders `wd/log/` (log files), `wd/rsrc/` (resource/benchmark files) and both
23+
reference data folders (`wd/global_ref/` and `wd/local_ref/`) should not contain
24+
any processed sample data (clean design), and are only relevant for workflow developers.
25+
26+
### Accounting: the file manifest
27+
28+
If properly set up, each workflow automatically creates a result file
29+
named `manifest.tsv` (in the folder `wd/results/`). This
30+
file lists all (i) input, (ii) reference (from `wd/global_ref/`) and (iii)
31+
result files together with metadata such as file size and data checksums
32+
(both MD5 and SHA256). This file is of utmost importance to track which
33+
input files in conjunction with which reference files were used to produce
34+
a certain set of result files. Never-ever delete this file.
35+
36+
**Important** Preparing the computation of all checksums etc. that are needed
37+
to complete the file manifest, is done during a `--dryrun` of the pipeline. If you
38+
are sure that the pipeline will run start to finish, run Snakemake with the
39+
option `--dryrun` twice before actually starting the computations. This will
40+
ensure that all metadata files (checksums etc.) will be known to Snakemake
41+
when the pipeline run starts and will be created as part of the regular flow
42+
of computations.
43+
44+
### Rerunning the exact same workflow
45+
46+
By default, after a successful pipeline run, a complete dump of the workflow
47+
configuration is written to a file named `run_config.yaml` (in the folder
48+
`wd/results/`). This configuration dump includes the information which user
49+
executated the workflow and which (code) version of the worklow was used.
50+
Assuming that the execution infrastructure (i.e., the compute cluster) is
51+
the same, it is possible to use just this configuration file to rerun the workflow
52+
in the exact same way. Never-ever delete this file.
53+
54+
## Documentation for users
55+
56+
If you want to use this workflow as a black box to process your data,
57+
simply follow the below series of steps to get things up and running.
58+
59+
### Deploying the workflow on execution hardware
60+
61+
1. run `./init.py` (requires Python3)
62+
- this will create an "execution" Conda environment,
63+
and a Snakemake working directory plus standard subfolders
64+
one level above the repository location
65+
2. activate the created Conda environment: `cd .. && conda activate ./exec_env`
66+
3. prepare profile and configuration files as necessary, and run Snakemake
67+
68+
69+
### Detailed output specification
70+
71+
For detailed descriptions, this should be moved in a separate markdown file.
72+
73+
## Documentation for developers
74+
75+
### Developing the workflow locally
76+
77+
1. run `./init.py --dev-only` (requires Python3)
78+
- this will skip creating the workflow working directory and subfolders
79+
2. activate the created Conda environment: `conda activate ./dev_env`
80+
3. write your code, and add tests to `workflow/snaketests.smk`
81+
4. run tests:
82+
- note that some tests may be expected to fail and may produce error messages
83+
- if Snakemake reports a successful pipeline run, then all tests have succeeded
84+
irrespective of log messages that look like errors
85+
- if you want to test the functions loading reference data from reference containers,
86+
you need to build the test container `test_v0.sif` and copy it into the
87+
working directory for the workflow test run. Refer to the
88+
[reference container repository](https://github.com/core-unit-bioinformatics/reference-container)
89+
for build instructions.
90+
91+
```bash
92+
# Example: test w/o reference container
93+
# Note: execute the workflow first in
94+
# '--dryrun' mode to trigger (and test)
95+
# the complete MANIFEST creation
96+
snakemake --cores 1 \
97+
[--dryrun] \
98+
--config devmode=True \
99+
--directory wd/ \
100+
--snakefile workflow/snaketests.smk
101+
102+
# Example: test w/ reference container;
103+
# the container 'test_v0.sif' must exist
104+
# in the working directory: 'wd/test_v0.sif'
105+
# Note: execute the workflow first in
106+
# '--dryrun' mode to trigger (and test)
107+
# the complete MANIFEST creation
108+
snakemake --cores 1 \
109+
[--dryrun] \
110+
--config devmode=True \
111+
--directory wd/ \
112+
--configfiles config/testing/params_refcon.yaml \
113+
--snakefile workflow/snaketests.smk
114+
```
115+
116+
5. run the recommended code checks with the following tools:
117+
- Python scripts:
118+
- linting: `pylint <script.py>`
119+
- organize imports: `isort <script.py>`
120+
- code formatting: `black [--check] .`
121+
- Snakemake files:
122+
- linting: `snakemake --config devmode=True --lint`
123+
- code formatting: `snakefmt [--check] .`
124+
6. after checking and standardizing your code, commit and push your changes

workflow/Snakefile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,15 @@
1+
include: "rules/commons/00_commons.smk"
12
include: "rules/00_modules.smk"
23

34

45
rule run_all:
56
input:
67
RUN_CONFIG_RELPATH,
8+
MANIFEST_RELPATH,
9+
# add output of final rule(s) here
10+
# to trigger complete run
711
[],
12+
813

914

1015
onsuccess:

workflow/rules/00_modules.smk

Lines changed: 6 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,6 @@
1-
# some constants built from the Snakemake
2-
# command line arguments
3-
include: "commons/01_constants.smk"
4-
# Module containing generic
5-
# Python utility functions
6-
include: "commons/02_pyutils.smk"
7-
# Module containing generic
8-
# Snakemake utility rules, e.g., dump_config
9-
include: "commons/03_smkutils.smk"
10-
# Module containing
11-
# reference container location information
12-
include: "commons/05_refcon.smk"
1+
"""
2+
Use this module to list all includes
3+
required for your pipeline - do not
4+
add your pipeline-specific modules
5+
to "commons/00_commons.smk"
6+
"""
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# some constants built from the Snakemake
2+
# command line arguments
3+
include: "01_constants.smk"
4+
# Module containing generic
5+
# Python utility functions
6+
include: "02_pyutils.smk"
7+
# Module containing generic
8+
# Snakemake utility rules, e.g., dump_config
9+
include: "03_smkutils.smk"
10+
# Module containing
11+
# reference container location information
12+
include: "05_refcon.smk"
13+
# Module performing state/env-altering
14+
# operations before Snakemake starts its
15+
# actual work
16+
include: "09_staging.smk"

workflow/rules/commons/01_constants.smk

Lines changed: 80 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
import pathlib
2+
import sys
3+
import enum
4+
5+
WAIT_ACC_LOCK_SECS = config.get("wait_acc_lock_secs", 60)
26

37
CPU_LOW = config.get("cpu_low", 2)
48
assert isinstance(CPU_LOW, int)
@@ -10,6 +14,11 @@ assert isinstance(CPU_HIGH, int)
1014
CPU_MAX = config.get("cpu_max", 8)
1115
assert isinstance(CPU_MAX, int)
1216

17+
# special case: the --dry-run option is not accessible
18+
# as, e.g., an attribute of the workflow object and has
19+
# to be extracted later (see 09_staging.smk)
20+
DRYRUN = None
21+
1322
# stored if needed for logging purposes
1423
VERBOSE = workflow.verbose
1524
assert isinstance(VERBOSE, bool)
@@ -45,6 +54,10 @@ WORKDIR = DIR_WORKING
4554
RUN_IN_DEV_MODE = config.get("devmode", False)
4655
assert isinstance(RUN_IN_DEV_MODE, bool)
4756

57+
# should the accounting files be reset?
58+
RESET_ACCOUNTING = config.get("resetacc", False)
59+
assert isinstance(RESET_ACCOUNTING, bool)
60+
4861
# if the workflow is executed in development mode,
4962
# the paths underneath the working directory
5063
# may not exist and that is ok
@@ -87,8 +100,12 @@ WD_RELPATH_LOCAL_REF = WD_ABSPATH_LOCAL_REF.relative_to(DIR_WORKING)
87100
DIR_LOCAL_REF = WD_RELPATH_LOCAL_REF
88101

89102
# fix name of run config dump file and location
90-
RUN_CONFIG_RELPATH = DIR_RES / pathlib.Path("run_config.yaml")
91-
RUN_CONFIG_ABSPATH = DIR_WORKING / RUN_CONFIG_RELPATH
103+
RUN_CONFIG_RELPATH = DIR_RES.joinpath("run_config.yaml")
104+
RUN_CONFIG_ABSPATH = RUN_CONFIG_RELPATH.resolve()
105+
106+
# fix name of manifest file and location
107+
MANIFEST_RELPATH = DIR_RES.joinpath("manifest.tsv")
108+
MANIFEST_ABSPATH = MANIFEST_RELPATH.resolve()
92109

93110
# specific constants for the use of reference containers
94111
# as part of the pipeline
@@ -113,3 +130,64 @@ if USE_REFERENCE_CONTAINER:
113130
else:
114131
DIR_REFERENCE_CONTAINER = pathlib.Path("/")
115132
DIR_REFCON = DIR_REFERENCE_CONTAINER
133+
134+
135+
ACCOUNTING_FILES = {
136+
"inputs": DIR_PROC.joinpath(".accounting", "inputs.listing"),
137+
"references": DIR_PROC.joinpath(".accounting", "references.listing"),
138+
"results": DIR_PROC.joinpath(".accounting", "results.listing"),
139+
}
140+
141+
142+
class TimeUnit(enum.Enum):
143+
HOUR = 1
144+
hour = 1
145+
hours = 1
146+
hrs = 1
147+
h = 1
148+
MINUTE = 2
149+
minute = 2
150+
minutes = 2
151+
min = 2
152+
m = 2
153+
SECOND = 3
154+
second = 3
155+
seconds = 3
156+
sec = 3
157+
s = 3
158+
159+
160+
class MemoryUnit(enum.Enum):
161+
BYTE = 0
162+
byte = 0
163+
bytes = 0
164+
b = 0
165+
B = 0
166+
KiB = 1
167+
kib = 1
168+
kb = 1
169+
KB = 1
170+
k = 1
171+
K = 1
172+
kibibyte = 1
173+
MiB = 2
174+
mib = 2
175+
mb = 2
176+
MB = 2
177+
m = 2
178+
M = 2
179+
mebibyte = 2
180+
GiB = 3
181+
gib = 3
182+
gb = 3
183+
GB = 3
184+
g = 3
185+
G = 3
186+
gibibyte = 3
187+
TiB = 4
188+
tib = 4
189+
tb = 4
190+
TB = 4
191+
t = 4
192+
T = 4
193+
tebibyte = 4

0 commit comments

Comments
 (0)