Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: workflow and config templates #29

Open
wants to merge 9 commits into
base: am_feat_template_update
Choose a base branch
from
Open
70 changes: 67 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,74 @@ myworkflow
│   └── CODEOWNERS
├── .gitignore
├── README.md
├── Snakefile
├── config
│   ├── config.yml
│   └── config_schema.yml
├── environment.yml
└── workflow
└── myworkflow.smk
└── environment.yml
```

## Development

Read the [Snakemake Best Practices](https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html) and the [Fulcrum Snakemake](https://www.notion.so/fulcrumgenomics/Snakemake-3d836708c9bc47ca868ee9a09ada7d0d) documentation. The text below is adapted from the latter.

Copy link
Author

@ameynert ameynert Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this text should be used to update the Notion page instead? The template is only intended for internal use.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine moving most/all of the below text to the Notion page and linking from here. (Just use suggest changes so N/T/C can review.) That way, we're not maintaining it in two locations.

And I find it easier to make edits and suggestions to longform text in Notion or gdocs rather than a PR

If there is a single Snakemake workflow, it should be named `Snakefile` and kept at the top level of the repository. If there are multiple workflows, name them according to their function and give them the extension `.smk`.

Workflow files should mainly contain `rules`. Any additional code should be added to a separate `Python` toolkit, for example to parse the configuration object, handle samplesheet input, or to organize reference data.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Workflow files should mainly contain `rules`. Any additional code should be added to a separate `Python` toolkit, for example to parse the configuration object, handle samplesheet input, or to organize reference data.
Workflow files should mainly contain `rules`. Any additional code should be added to a separate Python toolkit, for example to parse the configuration object, handle samplesheet input, or to organize reference data.


Requirements (e.g. executables) should not be specified in the workflows, but should be maintained as part of an environment (e.g. via [Mamba](https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html)).

The following should be followed:

1. All rules should have descriptive names
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. All rules should have descriptive names
1. All rules should have descriptive names.

2. All rules should have a short docstring describing what the rule does, and what tool(s) it uses. If there are any custom input or output file formats, describe them, e.g. for a CSV/TSV file list the expected field names.
ameynert marked this conversation as resolved.
Show resolved Hide resolved
3. All rules should have the following directives
- `input`: the input paths
- `output`: the output paths
- `log`: the path to the log file. Good practice is to include the workflow name, rule name, and any wildcards in the log file name, e.g. `logs/{workflow_name}.{rule_name}.{wildcard1}.{wildcard2}.log`.
ameynert marked this conversation as resolved.
Show resolved Hide resolved
4. The following directives are optional, but recommended when known:
- `params`: any custom metadata, both static and conditional on the wildcards
- `threads`: the number of threads to use
- `resources`: specifies custom resources with the following keywords:
- `mem_gb`: the amount of memory to allocate (in gigabytes, ex. `8` for eight gigabytes)
5. The parameters to the `input`, `params`, and `output` directives should have keywords
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. The parameters to the `input`, `params`, and `output` directives should have keywords
5. Inputs, outputs, and parameters should always be provided with keywords, and should always be referenced within the `shell` block by keyword (never positionally).


```python
rule:
input:
bam='/path/to/bam'
```

6. Shell commands should not contain references to global variables, but only rule directives. Data needed to build the command should be stored in the `params` data structure.
7. Both standard input and standard output should be piped to the log file.
msto marked this conversation as resolved.
Show resolved Hide resolved

```python
rule:
...
shell:
"""
(
echo "Hello Zorld" | sed -e 's_Z_W_'
) &> {log}
"""
```

8. Use `ALL_CAPS` for naming global variables. This will help distinguish them from local variables and parameters when used in input functions.
9. Simplify inputs with the use of [Snakemake helper functions](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#helpers-for-defining-rules) e.g. `expand` and `multiext`.
10. Use [input functions](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#input-functions) where appropriate.
11. Be consistent with your file separator character.

### Configuration

[Snakemake configuration documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Snakemake configuration documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).
[Snakemake supports configuration and validation of workflow parameters.](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html).


To ensure valid inputs to your workflow, use a [configuration schema](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#validation).

This template includes an example [configuration schema](config/config_schema.yml) and [configuration file](config/config.yml) to get you started.
The [workflow](Snakefile) looks for the configuration schema to validate the configuration file:

```python
from snakemake.utils import validate

validate(config, workflow.basedir + "/config/config_schema.yml")
```
17 changes: 17 additions & 0 deletions {{cookiecutter.project_slug}}/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,20 @@ To install and activate:
mamba env create -f environment.yml
mamba activate {{cookiecutter.project_slug}}
```

## Configure and run the workflow

The [workflow configuration schema](config/config_schema.yml) describes the parameters for the workflow.
To set the parameters for a specific run of the workflow, write a [configuration file](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html) using the schema and the [example config/config.yml](config/config.yml) as a guide.

If you do not specify a workflow file with e.g. `-s myworkflow.smk`, Snakemake will look, in this order, for a file called `Snakefile`, `snakefile`, `workflow/Snakefile`, or `workflow/snakefile`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of the Snakemake basics do we think are necessary to include in this template?

Some of this feels more appropriate as links to the relevant sections of the Snakemake docs, or assumed as prior knowledge for the user


```console
snakemake -j12 --configfile config/config.yml
```

You can override specific values on the command line with the `--config` parameter.

```console
snakemake -j12 --configfile config/config.yml --config experiment=myexperiment
```
50 changes: 50 additions & 0 deletions {{cookiecutter.project_slug}}/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
################################################################################
# Pipeline for {{cookeicutter.project_slug}}
ameynert marked this conversation as resolved.
Show resolved Hide resolved
################################################################################

from fgsmk.log import on_error
from snakemake.utils import validate


################################################################################
# Utility methods and variables
################################################################################

validate(config, workflow.basedir + "/config/config_schema.yml")

onerror:
on_error(snakefile=Path(__file__), config=config, log=Path(log))
"""Block of code that gets called if the snakemake pipeline exits with an error."""

################################################################################
# Snakemake rules
################################################################################

SAMPLES = config["samples"]
READS = ["1", "2"]


rule all:
input:
"data/resources/ref.fa",
expand("data/raw/{sample}_R{read}.fastq.gz", sample=SAMPLES, read=READS),


rule download_raw_data:
output:
"data/raw/{sample}_R{read}.fastq.gz",
shell:
"""
# wget -O {output} https://to/data/{wildcards.sample}_R{wildcards.read}.fastq.gz
touch {output}
"""

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change


rule download_resource_data:
output:
"data/resources/ref.fa",
shell:
"""
# wget -O {output} https://to/data/ref.fa
touch {output}
"""
ameynert marked this conversation as resolved.
Show resolved Hide resolved
6 changes: 6 additions & 0 deletions {{cookiecutter.project_slug}}/config/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
################################################################################
# Configuration file for {{cookeicutter.project_slug}}
ameynert marked this conversation as resolved.
Show resolved Hide resolved
################################################################################

experiment: "experiment1"
samples: ["sample1", "sample2"]
24 changes: 24 additions & 0 deletions {{cookiecutter.project_slug}}/config/config_schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
$schema: "https://json-schema.org/draft/2020-12/schema"
description: Config schema for {{cookiecutter.project_slug}}

type: object

properties:
experiment:
type: string
description: Name of the experiment.
example: "{{cookiecutter.project_slug}}"

samples:
type: array
description: List of samples.
example: ["sample1", "sample2"]

p_value_cutoff:
type: number
description: P-value cutoff for statistical significance.
default: 0.05

required:
- experiment
- samples
Comment on lines +6 to +24
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should update the workflow to use these config params? Or, choose contrived example params that could be used in the workflow?

msto marked this conversation as resolved.
Show resolved Hide resolved
Empty file.