core-unit-bioinformatics
diff --git a/‎README.md‎
Lines changed: 5 additions & 55 deletions b/‎README.md‎
Lines changed: 5 additions & 55 deletions
diff --git a/‎config/testing/params_refcon.yaml‎
Lines changed: 1 addition & 1 deletion b/‎config/testing/params_refcon.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/README.md‎
Lines changed: 124 additions & 0 deletions b/‎docs/README.md‎
Lines changed: 124 additions & 0 deletions
diff --git a/‎workflow/Snakefile‎
Lines changed: 5 additions & 0 deletions b/‎workflow/Snakefile‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎workflow/rules/00_modules.smk‎
Lines changed: 6 additions & 12 deletions b/‎workflow/rules/00_modules.smk‎
Lines changed: 6 additions & 12 deletions
diff --git a/‎workflow/rules/commons/00_commons.smk‎
Lines changed: 16 additions & 0 deletions b/‎workflow/rules/commons/00_commons.smk‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎workflow/rules/commons/01_constants.smk‎
Lines changed: 80 additions & 2 deletions b/‎workflow/rules/commons/01_constants.smk‎
Lines changed: 80 additions & 2 deletions
@@ -12,67 +12,17 @@ In essence, the following should suffice to get started:
 3. Conda (or mamba): for automatic setup of all software dependencies
     - Note that you can run the workflow w/o Conda, but then all tools listed under `workflow/envs/*.yaml` [not `dev_env.yaml`] need to be in your `$PATH`.
 
-Internal (template) remark: adapt the above if the workflow deployment has additional requirements (e.g., Singularity).
+For a detailed setup guide, please refer to [the workflow documentation](docs/README.md).
+
+**Internal (template) remark**: adapt the above if the workflow deployment has additional requirements (e.g., Singularity).
 
 ## Required input data
 
-Add info here
+Add info here - be concise, and provide more details in [the workflow documentation](docs/README.md).
 
 ## Produced output data
 
-Add info here
-
-# Deploying the workflow on execution hardware
-
-1. run `./init.py` (requires Python3)
-    - this will create an "execution" Conda environment,
-    and a Snakemake working directory plus standard subfolders
-    one level above the repository location
-2. activate the created Conda environment: `cd .. && conda activate ./exec_env`
-3. prepare profile and configuration files as necessary, and run Snakemake
-
-
-# Developing the workflow locally
-
-1. run `./init.py --dev-only` (requires Python3)
-    - this will skip creating the workflow working directory and subfolders
-2. activate the created Conda environment: `conda activate ./dev_env`
-3. write your code, and add tests to `workflow/snaketests.smk`
-4. run tests:
-    - note that some tests may be expected to fail and may produce error messages
-    - if Snakemake reports a successful pipeline run, then all tests have succeeded
-      irrespective of log messages that look like errors
-    - if you want to test the functions loading reference data from reference containers,
-      you need to build the test container `test_v0.sif` and copy it into the
-      working directory for the workflow test run. Refer to the
-      [reference container repository](https://github.com/core-unit-bioinformatics/reference-container)
-      for build instructions.
-```bash
-# test w/o reference container
-snakemake --cores 1 \
-    --config devmode=True \
-    --directory wd/ \
-    --snakefile workflow/snaketests.smk
-
-# test w/ reference container
-# the container 'test_v0.sif' must exist
-# in the working directory: 'wd/test_v0.sif'
-snakemake --cores 1 \
-    --config devmode=True \
-    --directory wd/ \
-    --configfiles config/testing/params_refcon.yaml \
-    --snakefile workflow/snaketests.smk
-```
-    
-5. run the recommended code checks with the following tools:
-    - Python scripts:
-        - linting: `pylint <script.py>`
-        - organize imports: `isort <script.py>`
-        - code formatting: `black [--check] .`
-    - Snakemake files:
-        - linting: `snakemake --config devmode=True --lint`
-        - code formatting: `snakefmt [--check] .`
-6. after checking and standardizing your code, commit and push your changes
+Add info here - be concise, and provide more details in [the workflow documentation](docs/README.md).
 
 # Citation
 
 
@@ -1,5 +1,5 @@
 
-reference_container_store: "../test_folder/"
+reference_container_store: "."
 reference_container_names: 
   - test_v0
 use_reference_container: True
 
@@ -0,0 +1,124 @@
+# Documentation for Snakemake workflow NAME HERE
+
+Describe the purpose of the workflow (the big picture)
+
+## General concepts
+
+### Folder structure
+
+By following the below guides, you will end up with a Snakemake
+working directory (for short: `wd`) with the following subfolders:
+
+`wd/results/`: this folder contains final results and some workflow metadata
+and represents the only relevant output folder from the end user perspective.
+Accidentally deleting this folder after a successful pipeline run means you have to
+restart the pipeline and Snakemake will check which result files have to be recreated.
+
+`wd/proc/`: the `processing` folder contains intermediate files
+and is not of interest to end users. As a design principle, deleting this
+folder after successful pipeline execution must not result in the loss of
+any relevant data.
+
+The folders `wd/log/` (log files), `wd/rsrc/` (resource/benchmark files) and both
+reference data folders (`wd/global_ref/` and `wd/local_ref/`) should not contain
+any processed sample data (clean design), and are only relevant for workflow developers.
+
+### Accounting: the file manifest
+
+If properly set up, each workflow automatically creates a result file
+named `manifest.tsv` (in the folder `wd/results/`). This
+file lists all (i) input, (ii) reference (from `wd/global_ref/`) and (iii)
+result files together with metadata such as file size and data checksums
+(both MD5 and SHA256). This file is of utmost importance to track which
+input files in conjunction with which reference files were used to produce
+a certain set of result files. Never-ever delete this file.
+
+**Important** Preparing the computation of all checksums etc. that are needed
+to complete the file manifest, is done during a `--dryrun` of the pipeline. If you
+are sure that the pipeline will run start to finish, run Snakemake with the
+option `--dryrun` twice before actually starting the computations. This will
+ensure that all metadata files (checksums etc.) will be known to Snakemake
+when the pipeline run starts and will be created as part of the regular flow
+of computations.
+
+### Rerunning the exact same workflow
+
+By default, after a successful pipeline run, a complete dump of the workflow
+configuration is written to a file named `run_config.yaml` (in the folder
+`wd/results/`). This configuration dump includes the information which user
+executated the workflow and which (code) version of the worklow was used.
+Assuming that the execution infrastructure (i.e., the compute cluster) is
+the same, it is possible to use just this configuration file to rerun the workflow
+in the exact same way. Never-ever delete this file.
+
+## Documentation for users
+
+If you want to use this workflow as a black box to process your data,
+simply follow the below series of steps to get things up and running.
+
+### Deploying the workflow on execution hardware
+
+1. run `./init.py` (requires Python3)
+    - this will create an "execution" Conda environment,
+    and a Snakemake working directory plus standard subfolders
+    one level above the repository location
+2. activate the created Conda environment: `cd .. && conda activate ./exec_env`
+3. prepare profile and configuration files as necessary, and run Snakemake
+
+
+### Detailed output specification
+
+For detailed descriptions, this should be moved in a separate markdown file.
+
+## Documentation for developers
+
+### Developing the workflow locally
+
+1. run `./init.py --dev-only` (requires Python3)
+    - this will skip creating the workflow working directory and subfolders
+2. activate the created Conda environment: `conda activate ./dev_env`
+3. write your code, and add tests to `workflow/snaketests.smk`
+4. run tests:
+    - note that some tests may be expected to fail and may produce error messages
+    - if Snakemake reports a successful pipeline run, then all tests have succeeded
+      irrespective of log messages that look like errors
+    - if you want to test the functions loading reference data from reference containers,
+      you need to build the test container `test_v0.sif` and copy it into the
+      working directory for the workflow test run. Refer to the
+      [reference container repository](https://github.com/core-unit-bioinformatics/reference-container)
+      for build instructions.
+
+```bash
+# Example: test w/o reference container
+# Note: execute the workflow first in
+# '--dryrun' mode to trigger (and test)
+# the complete MANIFEST creation
+snakemake --cores 1 \
+    [--dryrun] \
+    --config devmode=True \
+    --directory wd/ \
+    --snakefile workflow/snaketests.smk
+
+# Example: test w/ reference container;
+# the container 'test_v0.sif' must exist
+# in the working directory: 'wd/test_v0.sif'
+# Note: execute the workflow first in
+# '--dryrun' mode to trigger (and test)
+# the complete MANIFEST creation
+snakemake --cores 1 \
+    [--dryrun] \
+    --config devmode=True \
+    --directory wd/ \
+    --configfiles config/testing/params_refcon.yaml \
+    --snakefile workflow/snaketests.smk
+```
+    
+5. run the recommended code checks with the following tools:
+    - Python scripts:
+        - linting: `pylint <script.py>`
+        - organize imports: `isort <script.py>`
+        - code formatting: `black [--check] .`
+    - Snakemake files:
+        - linting: `snakemake --config devmode=True --lint`
+        - code formatting: `snakefmt [--check] .`
+6. after checking and standardizing your code, commit and push your changes
@@ -1,10 +1,15 @@
+include: "rules/commons/00_commons.smk"
 include: "rules/00_modules.smk"
 
 
 rule run_all:
     input:
         RUN_CONFIG_RELPATH,
+        MANIFEST_RELPATH,
+        # add output of final rule(s) here
+        # to trigger complete run
         [],
+        
 
 
 onsuccess:
 
@@ -1,12 +1,6 @@
-# some constants built from the Snakemake
-# command line arguments
-include: "commons/01_constants.smk"
-# Module containing generic
-# Python utility functions
-include: "commons/02_pyutils.smk"
-# Module containing generic
-# Snakemake utility rules, e.g., dump_config
-include: "commons/03_smkutils.smk"
-# Module containing
-# reference container location information
-include: "commons/05_refcon.smk"
+"""
+Use this module to list all includes
+required for your pipeline - do not
+add your pipeline-specific modules
+to "commons/00_commons.smk"
+"""
@@ -0,0 +1,16 @@
+# some constants built from the Snakemake
+# command line arguments
+include: "01_constants.smk"
+# Module containing generic
+# Python utility functions
+include: "02_pyutils.smk"
+# Module containing generic
+# Snakemake utility rules, e.g., dump_config
+include: "03_smkutils.smk"
+# Module containing
+# reference container location information
+include: "05_refcon.smk"
+# Module performing state/env-altering
+# operations before Snakemake starts its
+# actual work
+include: "09_staging.smk"
@@ -1,4 +1,8 @@
 import pathlib
+import sys
+import enum
+
+WAIT_ACC_LOCK_SECS = config.get("wait_acc_lock_secs", 60)
 
 CPU_LOW = config.get("cpu_low", 2)
 assert isinstance(CPU_LOW, int)
@@ -10,6 +14,11 @@ assert isinstance(CPU_HIGH, int)
 CPU_MAX = config.get("cpu_max", 8)
 assert isinstance(CPU_MAX, int)
 
+# special case: the --dry-run option is not accessible
+# as, e.g., an attribute of the workflow object and has
+# to be extracted later (see 09_staging.smk)
+DRYRUN = None
+
 # stored if needed for logging purposes
 VERBOSE = workflow.verbose
 assert isinstance(VERBOSE, bool)
@@ -45,6 +54,10 @@ WORKDIR = DIR_WORKING
 RUN_IN_DEV_MODE = config.get("devmode", False)
 assert isinstance(RUN_IN_DEV_MODE, bool)
 
+# should the accounting files be reset?
+RESET_ACCOUNTING = config.get("resetacc", False)
+assert isinstance(RESET_ACCOUNTING, bool)
+
 # if the workflow is executed in development mode,
 # the paths underneath the working directory
 # may not exist and that is ok
@@ -87,8 +100,12 @@ WD_RELPATH_LOCAL_REF = WD_ABSPATH_LOCAL_REF.relative_to(DIR_WORKING)
 DIR_LOCAL_REF = WD_RELPATH_LOCAL_REF
 
 # fix name of run config dump file and location
-RUN_CONFIG_RELPATH = DIR_RES / pathlib.Path("run_config.yaml")
-RUN_CONFIG_ABSPATH = DIR_WORKING / RUN_CONFIG_RELPATH
+RUN_CONFIG_RELPATH = DIR_RES.joinpath("run_config.yaml")
+RUN_CONFIG_ABSPATH = RUN_CONFIG_RELPATH.resolve()
+
+# fix name of manifest file and location
+MANIFEST_RELPATH = DIR_RES.joinpath("manifest.tsv")
+MANIFEST_ABSPATH = MANIFEST_RELPATH.resolve()
 
 # specific constants for the use of reference containers
 # as part of the pipeline
@@ -113,3 +130,64 @@ if USE_REFERENCE_CONTAINER:
 else:
     DIR_REFERENCE_CONTAINER = pathlib.Path("/")
 DIR_REFCON = DIR_REFERENCE_CONTAINER
+
+
+ACCOUNTING_FILES = {
+    "inputs": DIR_PROC.joinpath(".accounting", "inputs.listing"),
+    "references": DIR_PROC.joinpath(".accounting", "references.listing"),
+    "results": DIR_PROC.joinpath(".accounting", "results.listing"),
+}
+
+
+class TimeUnit(enum.Enum):
+    HOUR = 1
+    hour = 1
+    hours = 1
+    hrs = 1
+    h = 1
+    MINUTE = 2
+    minute = 2
+    minutes = 2
+    min = 2
+    m = 2
+    SECOND = 3
+    second = 3
+    seconds = 3
+    sec = 3
+    s = 3
+
+
+class MemoryUnit(enum.Enum):
+    BYTE = 0
+    byte = 0
+    bytes = 0
+    b = 0
+    B = 0
+    KiB = 1
+    kib = 1
+    kb = 1
+    KB = 1
+    k = 1
+    K = 1
+    kibibyte = 1
+    MiB = 2
+    mib = 2
+    mb = 2
+    MB = 2
+    m = 2
+    M = 2
+    mebibyte = 2
+    GiB = 3
+    gib = 3
+    gb = 3
+    GB = 3
+    g = 3
+    G = 3
+    gibibyte = 3
+    TiB = 4
+    tib = 4
+    tb = 4
+    TB = 4
+    t = 4
+    T = 4
+    tebibyte = 4