Skip to content

feat: workflow and config templates #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Mar 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
6e5e4d1
feat: template workflow and config/config_schema
ameynert Jan 30, 2025
c84ef78
feat: added development section to README
ameynert Jan 30, 2025
412660b
feat: moved snakefile and config up a level, added config schema doc
ameynert Jan 31, 2025
10a0725
feat: tweak config validation description
ameynert Jan 31, 2025
1c5896c
feat: added input and output description suggestions to README
ameynert Jan 31, 2025
f491ad6
fix: extra character in code block
ameynert Jan 31, 2025
aa04242
feat: expand recommendations
ameynert Feb 5, 2025
8563056
fix: typo
ameynert Feb 5, 2025
6d87fa8
feat: expanded example Snakefile, use docstrings on rules
ameynert Feb 5, 2025
09da155
chore: add error_summary.txt to .gitignore
ameynert Mar 10, 2025
b1f572b
feat: move guidelines to Notion page
ameynert Mar 10, 2025
dc42a5a
feat: use experiment in snakefile
ameynert Mar 10, 2025
143663d
feat: add reference to config
ameynert Mar 10, 2025
a9e1e09
feat: warning for input description in README
ameynert Mar 10, 2025
8be33e2
feat: update README with suggestions
ameynert Mar 10, 2025
0c5b872
feat: use configfile directive
ameynert Mar 10, 2025
155e171
feat: ran snakefmt linting
ameynert Mar 10, 2025
00e10c7
feat: simplified instructions
ameynert Mar 10, 2025
ae8e224
feat: remove error_summary.txt, prematurely added
ameynert Mar 10, 2025
5f55b34
feat: remove fgsmk function
ameynert Mar 10, 2025
daf3875
fix: README links
ameynert Mar 14, 2025
38aaa91
feat: table code-formatting for field names
ameynert Mar 14, 2025
82d7467
feat: clarified warning block
ameynert Mar 14, 2025
6b2f0e6
feat: add input descriptions to README
ameynert Mar 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 17 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,30 @@ Install [`cookiecutter`](https://cookiecutter.readthedocs.io/en/stable/) and run
You will have a Git-initialized Snakemake project at the following location:

```console
> tree -a myworkflow
❯ tree -a myworkflow
myworkflow
├── .git # contents omitted for brevity
├── .github
│   └── CODEOWNERS
├── .gitignore
├── README.md
├── Snakefile
├── config
│   ├── config.yml
│   └── config_schema.yml
├── environment.yml
└── workflow
└── myworkflow.smk
└── environment.yml
```

## Development

Read the [Snakemake Best Practices](https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html) and the [Fulcrum Snakemake](https://www.notion.so/fulcrumgenomics/Snakemake-3d836708c9bc47ca868ee9a09ada7d0d) documentation.

### Configuration

[Snakemake supports configuration and validation of workflow parameters](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).

To ensure valid inputs to your workflow, use a [configuration schema](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#validation).

This template includes an example [configuration schema]({{cookiecutter.project_slug}}/config/config_schema.yml) and [configuration file]({{cookiecutter.project_slug}}/config/config.yml) to get you started.

At runtime, the [workflow]({{cookiecutter.project_slug}}/Snakefile) [validates the provided configuration](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#validation) against the defined JSON schema.
61 changes: 60 additions & 1 deletion {{cookiecutter.project_slug}}/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,49 @@
# Snakemake workflow with Python toolkit
# {{cookiecutter.project_slug}}

{{cookiecutter.project_short_description}}

## Inputs

> [!WARNING]
> **After creating a new project, describe workflow input file formats here. **
>
> Use the [configuration schema](config/config_schema.yml) for simple input descriptions. Include URLs to input descriptions for 3rd party tools.

- `reference`: URL to the reference genome FASTA to be downloaded.
- `experiment`: Name of the experiment.
- `samples`: List of sample identifiers.
- `p_value_cutoff`: P-value cut-off for statistical significance.

> [!WARNING]
> Consider using [Markdown tables](https://www.tablesgenerator.com/markdown_tables) to describe the fields of custom TSV/CSV input files, e.g.

### Samplesheet

A TSV file with fields:

| Field | Type | Description |
|-------------------------|-----------------------|-----------------------------------------------------------------|
| `sample_id` | String, no whitespace | Sample identifier |
| `condition` | String, no whitespace | Abbreviated name for experimental condition, e.g. "neg_control" |
| `condition_description` | String | Long description of experimental condition |
| `fastq_r1` | Absolute path | Path to R1 FASTQ for sample |
| `fastq_r2` | Absolute path | Path to R2 FASTQ for sample |

## Outputs

> [!WARNING]
> **After creating a new project, describe workflow outputs here.**
>
> Consider using a `tree` output style format to describe the expected output file structure, URLs to third party file format descriptions, and tables as in [Inputs](#inputs) for custom output file formats.

```console
results
├── plots
│   └── {experiment}.heatmap.png # Heatmap describing counts for all samples
└── counts
   ├── {sample_name}.counts.tsv # Raw counts
   └── {sample_name}.counts.summary.tsv # Summary of counts
```

## Set up Environment

Expand All @@ -12,3 +57,17 @@ To install and activate:
mamba env create -f environment.yml
mamba activate {{cookiecutter.project_slug}}
```

## Configure and run the workflow

The [workflow configuration schema](config/config_schema.yml) describes the parameters for the workflow, and the [config file](config/config.yml) contains the parameter values.

```console
snakemake -j12
```

You can override specific values on the command line with the `--config` parameter.

```console
snakemake -j12 --config experiment=myexperiment
```
101 changes: 101 additions & 0 deletions {{cookiecutter.project_slug}}/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
################################################################################
# Pipeline for {{cookiecutter.project_slug}}
################################################################################

from snakemake.utils import validate


################################################################################
# Utility methods and variables
################################################################################


configfile: workflow.basedir + "/config/config.yml"


validate(config, workflow.basedir + "/config/config_schema.yml")


################################################################################
# Snakemake rules
################################################################################

EXPERIMENT = config["experiment"]
SAMPLES = config["samples"]
REFERENCE_URL = config["reference"]
READS = ["1", "2"]
BWA_INDEX_EXTS = [".amb", ".ann", ".bwt", ".pac", ".sa"]


rule all:
input:
multiext("data/resources/ref.fa", *BWA_INDEX_EXTS),
expand("data/raw/{EXPERIMENT}/{sample}_R{read}.fastq.gz", sample=SAMPLES, read=READS),


rule download_raw_data:
"""
Downloads the raw reads for each sample.

Output:
reads: A gzip-compressed FASTQ file.
"""
output:
reads="data/raw/{EXPERIMENT}/{sample}_R{read}.fastq.gz",
log:
"logs/download_raw_data.{sample}.R{read}.log",
shell:
"""
(
# wget -O {output.reads} https://to/data/{wildcards.sample}_R{wildcards.read}.fastq.gz
touch {output.reads}
) &> {log}
"""


rule index_reference_genome:
"""
Runs bwa indexing on the reference genome.

Input:
ref: The reference genome in FASTA format.

Output:
indexes: The BWA index files for the reference genome.
"""
input:
ref="data/resources/ref.fa",
output:
indexes=multiext("data/resources/ref.fa", *BWA_INDEX_EXTS),
log:
"logs/index_reference_genome.log",
shell:
"""
(
# bwa index {input.ref}
touch {output.indexes}
) &> {log}
"""


rule download_reference_genome:
"""
Downloads the reference genome FASTA file.

Output:
ref: The reference genome in FASTA format.
"""
params:
ref_url=REFERENCE_URL,
output:
ref="data/resources/ref.fa",
log:
"logs/download_reference_genome.log",
shell:
"""
(
# wget -O {output.ref}.gz {params.ref_url}
# gunzip {output.ref}.gz
touch {output.ref}
) &> {log}
"""
7 changes: 7 additions & 0 deletions {{cookiecutter.project_slug}}/config/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
################################################################################
# Configuration file for {{cookiecutter.project_slug}}
################################################################################

experiment: "experiment1"
samples: ["sample1", "sample2"]
reference: https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
30 changes: 30 additions & 0 deletions {{cookiecutter.project_slug}}/config/config_schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
$schema: "https://json-schema.org/draft/2020-12/schema"
description: Config schema for {{cookiecutter.project_slug}}

type: object

properties:
experiment:
type: string
description: Name of the experiment.
example: "{{cookiecutter.project_slug}}"

reference:
type: string
description: URL to the reference genome FASTA.
example: https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

samples:
type: array
description: List of sample identifiers.
example: ["sample1", "sample2"]

p_value_cutoff:
type: number
description: P-value cutoff for statistical significance.
default: 0.05

required:
- experiment
- samples
- reference
Empty file.