feat: workflow and config templates (#29)

ameynert · web-flow · commit a0c2f2537b7b · 2025-03-14T16:22:57.000-07:00
diff --git a/README.md b/README.md
@@ -21,17 +21,30 @@ Install [`cookiecutter`](https://cookiecutter.readthedocs.io/en/stable/) and run
 You will have a Git-initialized Snakemake project at the following location:
 
 ```console
-❯ > tree -a myworkflow 
+❯ tree -a myworkflow 
 myworkflow
 ├── .git # contents omitted for brevity
 ├── .github
 │   └── CODEOWNERS
 ├── .gitignore
 ├── README.md
+├── Snakefile
 ├── config
 │   ├── config.yml
 │   └── config_schema.yml
-├── environment.yml
-└── workflow
-    └── myworkflow.smk
+└── environment.yml
 ```
+
+## Development
+
+Read the [Snakemake Best Practices](https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html) and the [Fulcrum Snakemake](https://www.notion.so/fulcrumgenomics/Snakemake-3d836708c9bc47ca868ee9a09ada7d0d) documentation.
+
+### Configuration
+
+[Snakemake supports configuration and validation of workflow parameters](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).
+
+To ensure valid inputs to your workflow, use a [configuration schema](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#validation).
+
+This template includes an example [configuration schema]({{cookiecutter.project_slug}}/config/config_schema.yml) and [configuration file]({{cookiecutter.project_slug}}/config/config.yml) to get you started.
+
+At runtime, the [workflow]({{cookiecutter.project_slug}}/Snakefile) [validates the provided configuration](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#validation) against the defined JSON schema.
diff --git a/{{cookiecutter.project_slug}}/README.md b/{{cookiecutter.project_slug}}/README.md
@@ -1,4 +1,49 @@
-# Snakemake workflow with Python toolkit
+# {{cookiecutter.project_slug}}
+
+{{cookiecutter.project_short_description}}
+
+## Inputs
+
+> [!WARNING]
+> **After creating a new project, describe workflow input file formats here. **
+>
+> Use the [configuration schema](config/config_schema.yml) for simple input descriptions. Include URLs to input descriptions for 3rd party tools.
+
+- `reference`: URL to the reference genome FASTA to be downloaded.
+- `experiment`: Name of the experiment.
+- `samples`: List of sample identifiers.
+- `p_value_cutoff`: P-value cut-off for statistical significance.
+
+> [!WARNING]
+> Consider using [Markdown tables](https://www.tablesgenerator.com/markdown_tables) to describe the fields of custom TSV/CSV input files, e.g.
+
+### Samplesheet
+
+A TSV file with fields:
+
+| Field                   | Type                  | Description                                                     |
+|-------------------------|-----------------------|-----------------------------------------------------------------|
+| `sample_id`             | String, no whitespace | Sample identifier                                               |
+| `condition`             | String, no whitespace | Abbreviated name for experimental condition, e.g. "neg_control" |
+| `condition_description` | String                | Long description of experimental condition                      |
+| `fastq_r1`              | Absolute path         | Path to R1 FASTQ for sample                                     |
+| `fastq_r2`              | Absolute path         | Path to R2 FASTQ for sample                                     |
+
+## Outputs
+
+> [!WARNING]
+> **After creating a new project, describe workflow outputs here.**
+>
+> Consider using a `tree` output style format to describe the expected output file structure, URLs to third party file format descriptions, and tables as in [Inputs](#inputs) for custom output file formats.
+
+```console
+results
+├── plots
+│   └── {experiment}.heatmap.png         # Heatmap describing counts for all samples
+└── counts
+    ├── {sample_name}.counts.tsv         # Raw counts
+    └── {sample_name}.counts.summary.tsv # Summary of counts
+```
 
 ## Set up Environment
 
@@ -12,3 +57,17 @@ To install and activate:
 mamba env create -f environment.yml
 mamba activate {{cookiecutter.project_slug}}
 ```
+
+## Configure and run the workflow
+
+The [workflow configuration schema](config/config_schema.yml) describes the parameters for the workflow, and the [config file](config/config.yml) contains the parameter values.
+
+```console
+snakemake -j12
+```
+
+You can override specific values on the command line with the `--config` parameter.
+
+```console
+snakemake -j12 --config experiment=myexperiment
+```
diff --git a/{{cookiecutter.project_slug}}/Snakefile b/{{cookiecutter.project_slug}}/Snakefile
@@ -0,0 +1,101 @@
+################################################################################
+# Pipeline for {{cookiecutter.project_slug}}
+################################################################################
+
+from snakemake.utils import validate
+
+
+################################################################################
+# Utility methods and variables
+################################################################################
+
+
+configfile: workflow.basedir + "/config/config.yml"
+
+
+validate(config, workflow.basedir + "/config/config_schema.yml")
+
+
+################################################################################
+# Snakemake rules
+################################################################################
+
+EXPERIMENT = config["experiment"]
+SAMPLES = config["samples"]
+REFERENCE_URL = config["reference"]
+READS = ["1", "2"]
+BWA_INDEX_EXTS = [".amb", ".ann", ".bwt", ".pac", ".sa"]
+
+
+rule all:
+    input:
+        multiext("data/resources/ref.fa", *BWA_INDEX_EXTS),
+        expand("data/raw/{EXPERIMENT}/{sample}_R{read}.fastq.gz", sample=SAMPLES, read=READS),
+
+
+rule download_raw_data:
+    """
+    Downloads the raw reads for each sample.
+
+    Output:
+        reads: A gzip-compressed FASTQ file.
+    """
+    output:
+        reads="data/raw/{EXPERIMENT}/{sample}_R{read}.fastq.gz",
+    log:
+        "logs/download_raw_data.{sample}.R{read}.log",
+    shell:
+        """
+        (
+            # wget -O {output.reads} https://to/data/{wildcards.sample}_R{wildcards.read}.fastq.gz
+            touch {output.reads}
+        ) &> {log}
+        """
+
+
+rule index_reference_genome:
+    """
+    Runs bwa indexing on the reference genome.
+
+    Input:
+        ref: The reference genome in FASTA format.
+
+    Output:
+        indexes: The BWA index files for the reference genome.
+    """
+    input:
+        ref="data/resources/ref.fa",
+    output:
+        indexes=multiext("data/resources/ref.fa", *BWA_INDEX_EXTS),
+    log:
+        "logs/index_reference_genome.log",
+    shell:
+        """
+        (
+            # bwa index {input.ref}
+            touch {output.indexes}
+        ) &> {log}
+        """
+
+
+rule download_reference_genome:
+    """
+    Downloads the reference genome FASTA file.
+
+    Output:
+        ref: The reference genome in FASTA format.
+    """
+    params:
+        ref_url=REFERENCE_URL,
+    output:
+        ref="data/resources/ref.fa",
+    log:
+        "logs/download_reference_genome.log",
+    shell:
+        """
+        (
+            # wget -O {output.ref}.gz {params.ref_url}
+            # gunzip {output.ref}.gz
+            touch {output.ref}
+        ) &> {log}
+        """
diff --git a/{{cookiecutter.project_slug}}/config/config.yml b/{{cookiecutter.project_slug}}/config/config.yml
@@ -0,0 +1,7 @@
+################################################################################
+# Configuration file for {{cookiecutter.project_slug}}
+################################################################################
+
+experiment: "experiment1"
+samples: ["sample1", "sample2"]
+reference: https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
diff --git a/{{cookiecutter.project_slug}}/config/config_schema.yml b/{{cookiecutter.project_slug}}/config/config_schema.yml
@@ -0,0 +1,30 @@
+$schema: "https://json-schema.org/draft/2020-12/schema"
+description: Config schema for {{cookiecutter.project_slug}}
+
+type: object
+
+properties:
+  experiment:
+    type: string
+    description: Name of the experiment.
+    example: "{{cookiecutter.project_slug}}"
+
+  reference:
+    type: string
+    description: URL to the reference genome FASTA.
+    example: https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
+
+  samples:
+    type: array
+    description: List of sample identifiers.
+    example: ["sample1", "sample2"]
+
+  p_value_cutoff:
+    type: number
+    description: P-value cutoff for statistical significance.
+    default: 0.05
+
+required:
+  - experiment
+  - samples
+  - reference
diff --git a/{{cookiecutter.project_slug}}/workflow/{{cookiecutter.project_slug}}.smk b/{{cookiecutter.project_slug}}/workflow/{{cookiecutter.project_slug}}.smk