fulcrumgenomics · ameynert · Jan 30, 2025 · Jan 30, 2025 · Jan 31, 2025 · Jan 31, 2025
@@ -28,10 +28,74 @@ myworkflow
 │   └── CODEOWNERS
 ├── .gitignore
 ├── README.md
+├── Snakefile
 ├── config
 │   ├── config.yml
 │   └── config_schema.yml
-├── environment.yml
-└── workflow
-    └── myworkflow.smk
+└── environment.yml
+```
+
+## Development
+
+Read the [Snakemake Best Practices](https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html) and the [Fulcrum Snakemake](https://www.notion.so/fulcrumgenomics/Snakemake-3d836708c9bc47ca868ee9a09ada7d0d) documentation. The text below is adapted from the latter.
+
+If there is a single Snakemake workflow, it should be named `Snakefile` and kept at the top level of the repository. If there are multiple workflows, name them according to their function and give them the extension `.smk`.
+
+Workflow files should mainly contain `rules`.  Any additional code should be added to a separate `Python` toolkit, for example to parse the configuration object, handle samplesheet input, or to organize reference data.
-Workflow files should mainly contain `rules`.  Any additional code should be added to a separate `Python` toolkit, for example to parse the configuration object, handle samplesheet input, or to organize reference data.
+Workflow files should mainly contain `rules`.  Any additional code should be added to a separate Python toolkit, for example to parse the configuration object, handle samplesheet input, or to organize reference data.
-Workflow files should mainly contain `rules`.  Any additional code should be added to a separate `Python` toolkit, for example to parse the configuration object, handle samplesheet input, or to organize reference data.
+Workflow files should mainly contain `rules`.  Any additional code should be added to a separate Python toolkit, for example to parse the configuration object, handle samplesheet input, or to organize reference data.
+
+Requirements (e.g. executables) should not be specified in the workflows, but should be maintained as part of an environment (e.g. via [Mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html)).
+
+The following should be followed:
+
+1. All rules should have descriptive names
-1. All rules should have descriptive names
+1. All rules should have descriptive names.
-1. All rules should have descriptive names
+1. All rules should have descriptive names.
+2. All rules should have a short docstring describing what the rule does, and what tool(s) it uses. If there are any custom input or output file formats, describe them, e.g. for a CSV/TSV file list the expected field names.
+3. All rules should have the following directives
+    - `input`: the input paths
+    - `output`: the output paths
+    - `log`: the path to the log file. Good practice is to include the workflow name, rule name, and any wildcards in the log file name, e.g. `logs/{workflow_name}.{rule_name}.{wildcard1}.{wildcard2}.log`.
+4. The following directives are optional, but recommended when known:
+    - `params`: any custom metadata, both static and conditional on the wildcards
+    - `threads`: the number of threads to use
+    - `resources`: specifies custom resources with the following keywords:
+        - `mem_gb`: the amount of memory to allocate (in gigabytes, ex. `8` for eight gigabytes)
+5. The parameters to the `input`, `params`, and `output` directives should have keywords
-5. The parameters to the `input`, `params`, and `output` directives should have keywords
+5. Inputs, outputs, and parameters should always be provided with keywords, and should always be referenced within the `shell` block by keyword (never positionally).
-5. The parameters to the `input`, `params`, and `output` directives should have keywords
+5. Inputs, outputs, and parameters should always be provided with keywords, and should always be referenced within the `shell` block by keyword (never positionally).
+
+```python
+rule:
+    input:
+       bam='/path/to/bam'
+```
+
+6. Shell commands should not contain references to global variables, but only rule directives. Data needed to build the command should be stored in the `params` data structure.
+7. Both standard input and standard output should be piped to the log file.
+
+```python
+rule:
+    ...
+    shell:
+        """
+        (
+            echo "Hello Zorld" | sed -e 's_Z_W_'
+        ) &> {log}
+        """
+```
+
+8. Use `ALL_CAPS` for naming global variables. This will help distinguish them from local variables and parameters when used in input functions.
+9. Simplify inputs with the use of [Snakemake helper functions](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#helpers-for-defining-rules) e.g. `expand` and `multiext`.
+10. Use [input functions](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#input-functions) where appropriate.
+11. Be consistent with your file separator character.
+
+### Configuration
+
+[Snakemake configuration documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).
-[Snakemake configuration documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).
+[Snakemake supports configuration and validation of workflow parameters.](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).
-[Snakemake configuration documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).
+[Snakemake supports configuration and validation of workflow parameters.](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).
+
+To ensure valid inputs to your workflow, use a [configuration schema](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#validation).
+
+This template includes an example [configuration schema](config/config_schema.yml) and [configuration file](config/config.yml) to get you started.
+The [workflow](Snakefile) looks for the configuration schema to validate the configuration file:
+
+```python
+from snakemake.utils import validate
+
+validate(config, workflow.basedir + "/config/config_schema.yml")
 ```
@@ -12,3 +12,20 @@ To install and activate:
 mamba env create -f environment.yml
 mamba activate {{cookiecutter.project_slug}}
 ```
+
+## Configure and run the workflow
+
+The [workflow configuration schema](config/config_schema.yml) describes the parameters for the workflow.
+To set the parameters for a specific run of the workflow, write a [configuration file](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html) using the schema and the [example config/config.yml](config/config.yml) as a guide.
+
+If you do not specify a workflow file with e.g. `-s myworkflow.smk`, Snakemake will look, in this order, for a file called `Snakefile`, `snakefile`, `workflow/Snakefile`, or `workflow/snakefile`.
+
+```console
+snakemake -j12 --configfile config/config.yml
+```
+
+You can override specific values on the command line with the `--config` parameter.
+
+```console
+snakemake -j12 --configfile config/config.yml --config experiment=myexperiment
+```
@@ -0,0 +1,50 @@
+################################################################################
+# Pipeline for {{cookeicutter.project_slug}}
+################################################################################
+
+from fgsmk.log import on_error
+from snakemake.utils import validate
+
+
+################################################################################
+# Utility methods and variables
+################################################################################
+
+validate(config, workflow.basedir + "/config/config_schema.yml")
+
+onerror:
+    on_error(snakefile=Path(__file__), config=config, log=Path(log))
+    """Block of code that gets called if the snakemake pipeline exits with an error."""
+
+################################################################################
+# Snakemake rules
+################################################################################
+
+SAMPLES = config["samples"]
+READS = ["1", "2"]
+
+
+rule all:
+    input:
+        "data/resources/ref.fa",
+        expand("data/raw/{sample}_R{read}.fastq.gz", sample=SAMPLES, read=READS),
+
+
+rule download_raw_data:
+    output:
+        "data/raw/{sample}_R{read}.fastq.gz",
+    shell:
+        """
+        # wget -O {output} https://to/data/{wildcards.sample}_R{wildcards.read}.fastq.gz
+        touch {output}
+        """
+
-
-
+
+rule download_resource_data:
+    output:
+        "data/resources/ref.fa",
+    shell:
+        """
+        # wget -O {output} https://to/data/ref.fa
+        touch {output}
+        """
@@ -0,0 +1,6 @@
+################################################################################
+# Configuration file for {{cookeicutter.project_slug}}
+################################################################################
+
+experiment: "experiment1"
+samples: ["sample1", "sample2"]
@@ -0,0 +1,24 @@
+$schema: "https://json-schema.org/draft/2020-12/schema"
+description: Config schema for {{cookiecutter.project_slug}}
+
+type: object
+
+properties:
+  experiment:
+    type: string
+    description: Name of the experiment.
+    example: "{{cookiecutter.project_slug}}"
+
+  samples:
+    type: array
+    description: List of samples.
+    example: ["sample1", "sample2"]
+
+  p_value_cutoff:
+    type: number
+    description: P-value cutoff for statistical significance.
+    default: 0.05
+
+required:
+  - experiment
+  - samples