Skip to content

Commit 6282d2b

Browse files
authored
Merge pull request #183 from cgat-developers/AC-docs
Ac docs
2 parents 6328b14 + 76daa29 commit 6282d2b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+3077
-1
lines changed

.github/workflows/cgatcore_python.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,3 +53,24 @@ jobs:
5353
run: |
5454
pip install .
5555
./all-tests.sh
56+
57+
deploy_docs:
58+
name: Deploy MkDocs Documentation
59+
runs-on: ubuntu-latest
60+
needs: build
61+
steps:
62+
- uses: actions/checkout@v3
63+
64+
- name: Set up Python
65+
uses: actions/setup-python@v4
66+
with:
67+
python-version: '3.x'
68+
69+
- name: Install MkDocs and Dependencies
70+
run: |
71+
pip install mkdocs mkdocs-material
72+
73+
- name: Build and Deploy MkDocs Site
74+
run: mkdocs gh-deploy --force --clean
75+
env:
76+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2019 cgat-developers
3+
Copyright (c) 2024 cgat-developers
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Setting run parameters
2+
3+
Our workflows are executed using default settings that specify parameters for requirements such as memory, threads, environment, etc. Each of these parameters can be modified within the pipeline as needed.
4+
5+
## Modifiable run parameters
6+
7+
- **`job_memory`**: Number of slots (threads/cores/CPU) to use for the task. Default: "4G".
8+
- **`job_total_memory`**: Total memory to use for a job.
9+
- **`to_cluster`**: Send the job to the cluster. Default: `True`.
10+
- **`without_cluster`**: Run the job locally when set to `True`. Default: `False`.
11+
- **`cluster_memory_ulimit`**: Restrict virtual memory. Default: `False`.
12+
- **`job_condaenv`**: Name of the conda environment to use for each job. Default: will use the one specified in `bashrc`.
13+
- **`job_array`**: If set to `True`, run the statement as an array job. `job_array` should be a tuple with start, end, and increment values. Default: `False`.
14+
15+
## Specifying parameters to a job
16+
17+
Parameters can be set within a pipeline task as follows:
18+
19+
```python
20+
@transform('*.unsorted', suffix('.unsorted'), '.sorted')
21+
def sortFile(infile, outfile):
22+
statement = '''sort -t %(tmpdir)s %(infile)s > %(outfile)s'''
23+
P.run(statement,
24+
job_condaenv="sort_environment",
25+
job_memory="30G",
26+
job_threads=2,
27+
without_cluster=False,
28+
job_total_memory="50G")
29+
```
30+
31+
In this example, the `sortFile` function sorts an unsorted file and saves it as a new sorted file. The `P.run()` statement is used to specify various parameters:
32+
33+
- `job_condaenv="sort_environment"`: This specifies that the task should use the `sort_environment` conda environment.
34+
- `job_memory="30G"`: This sets the memory requirement for the task to 30GB.
35+
- `job_threads=2`: The task will use 2 threads.
36+
- `without_cluster=False`: This ensures the job is sent to the cluster.
37+
- `job_total_memory="50G"`: The total memory allocated for the job is 50GB.
38+
39+
These parameters allow fine-tuning of job execution to fit specific computational requirements, such as allocating more memory or running on a local machine rather than a cluster.

docs/defining_workflow/tutorial.md

Whitespace-only changes.
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Writing a workflow
2+
3+
## Our workflow philosophy
4+
5+
The explicit aim of CGAT-core is to allow users to quickly and easily build their own computational pipelines, speeding up their analysis workflow.
6+
7+
When building pipelines, it is often useful to keep in mind the following guiding principles:
8+
9+
### Flexibility
10+
11+
There are always new tools and insights that could be incorporated into a pipeline. Ultimately, a pipeline should be flexible, and the code should not constrain you when implementing new features.
12+
13+
### Scriptability
14+
15+
The pipeline should be scriptable, i.e., the entire pipeline can be run within another pipeline. Similarly, parts of a pipeline should be easily duplicated to process several data streams in parallel. This is crucial in genome studies, as a single analysis will not always permit making inferences by itself. When writing a pipeline, we typically create a command line script (included in the CGAT-apps repository) and then run this script as a command line statement in the pipeline.
16+
17+
### Reproducibility
18+
19+
The pipeline should be fully automated so that the same inputs and configuration produce the same outputs.
20+
21+
### Reusability
22+
23+
The pipeline should be reusable on similar data, preferably requiring only changes to a configuration file (such as `pipeline.yml`).
24+
25+
### Archivability
26+
27+
Once finished, the whole project should be archivable without relying heavily on external data. This process should be simple; all project data should be self-contained, without needing to go through various directories or databases to determine dependencies.
28+
29+
## Building a pipeline
30+
31+
The best way to build a pipeline is to start from an example. The [CGAT Showcase](https://cgat-showcase.readthedocs.io/en/latest/index.html) contains a toy example of an RNA-seq analysis pipeline, demonstrating how simple workflows can be generated with minimal code. For more complex workflows, you can refer to [CGAT-Flow](https://github.com/cgat-developers/cgat-flow).
32+
33+
For a step-by-step tutorial on running pipelines, refer to our [Getting Started Tutorial](#).
34+
35+
To construct a pipeline from scratch, continue reading below.
36+
37+
In an empty directory, create a new directory and then a Python file with the same name. For example:
38+
39+
```bash
40+
mkdir test && touch pipeline_test.py
41+
```
42+
43+
All pipelines require a `.yml` configuration file that allows you to modify the behaviour of your code. This file is placed in the `test/` directory and should have the same name as the pipeline Python file:
44+
45+
```bash
46+
touch test/pipeline.yml
47+
```
48+
49+
To facilitate debugging and reading, our pipelines are designed so that the pipeline task file contains Ruffus tasks, while the code to transform and analyse data is in an associated module file.
50+
51+
If you wish to create a module file, it is conventionally named using the format `ModuleTest.py`. You can import it into the main pipeline task file (`pipeline_test.py`) as follows:
52+
53+
```python
54+
import ModuleTest
55+
```
56+
57+
The [pipeline module](https://github.com/cgat-developers/cgat-core/tree/master/cgatcore/pipeline) in CGAT-core provides many useful functions for pipeline construction.
58+
59+
## Pipeline input
60+
61+
Pipelines are executed within a dedicated working directory, which usually contains:
62+
63+
- A pipeline configuration file: `pipeline.yml`
64+
- Input data files, typically specified in the pipeline documentation
65+
66+
Other files that might be used include external data files, such as genomes, referred to by their full path.
67+
68+
Pipelines work with input files in the working directory, usually identified by their suffix. For instance, a mapping pipeline might look for any `.fastq.gz` files in the directory, perform QC on them, and map the reads to a genome sequence.
69+
70+
## Pipeline output
71+
72+
The pipeline will generate files and database tables in the working directory. You can structure your files/directories in any way that fits your needs—some prefer a flat structure with many files, while others use deeper hierarchies.
73+
74+
To save disk space, compressed files should be used wherever possible. Most data files compress well; for example, `fastq` files often compress by up to 80%. Working with compressed files is straightforward using Unix pipes (`gzip`, `gunzip`, `zcat`).
75+
76+
If you need random access to a file, load it into a database and index it appropriately. Genomic interval files can be indexed with `tabix` to allow random access.
77+
78+
## Import statements
79+
80+
To run our pipelines, you need to import the CGAT-core Python modules into your pipeline. We recommend importing the following modules for every CGAT pipeline:
81+
82+
```python
83+
from ruffus import *
84+
import cgatcore.experiment as E
85+
from cgatcore import pipeline as P
86+
import cgatcore.iotools as iotools
87+
```
88+
89+
Additional modules can be imported as needed.
90+
91+
## Selecting the appropriate Ruffus decorator
92+
93+
Before starting a pipeline, it is helpful to map out the steps and flow of your potential pipeline on a whiteboard. This helps identify the inputs and outputs of each task. Once you have a clear picture, determine which Ruffus decorator to use for each task. For more information on each decorator, refer to the [Ruffus documentation](http://www.ruffus.org.uk/decorators/decorators.html).
94+
95+
## Running commands within tasks
96+
97+
To run a command line program within a pipeline task, build a statement and call the `P.run()` method:
98+
99+
```python
100+
@transform('*.unsorted', suffix('.unsorted'), '.sorted')
101+
def sortFile(infile, outfile):
102+
statement = '''sort %(infile)s > %(outfile)s'''
103+
P.run(statement)
104+
```
105+
106+
In the `P.run()` method, the environment of the caller is examined for a variable called `statement`, which is then subjected to string substitution from other variables in the local namespace. In the example above, `%(infile)s` and `%(outfile)s` are replaced with the values of `infile` and `outfile`, respectively.
107+
108+
The same mechanism also allows configuration parameters to be set, as shown here:
109+
110+
```python
111+
@transform('*.unsorted', suffix('.unsorted'), '.sorted')
112+
def sortFile(infile, outfile):
113+
statement = '''sort -t %(tmpdir)s %(infile)s > %(outfile)s'''
114+
P.run(statement)
115+
```
116+
117+
In this case, the configuration parameter `tmpdir` is substituted into the command.
118+
119+
### Chaining commands with error checking
120+
121+
If you need to chain multiple commands, you can use `&&` to ensure that errors in upstream commands are detected:
122+
123+
```python
124+
@transform('*.unsorted.gz', suffix('.unsorted.gz'), '.sorted')
125+
def sortFile(infile, outfile):
126+
statement = '''gunzip %(infile)s %(infile)s.tmp &&
127+
sort -t %(tmpdir)s %(infile)s.tmp > %(outfile)s &&
128+
rm -f %(infile)s.tmp'''
129+
P.run(statement)
130+
```
131+
132+
Alternatively, you can achieve this more efficiently using pipes:
133+
134+
```python
135+
@transform('*.unsorted.gz', suffix('.unsorted.gz'), '.sorted.gz')
136+
def sortFile(infile, outfile):
137+
statement = '''gunzip < %(infile)s | sort -t %(tmpdir)s | gzip > %(outfile)s'''
138+
P.run(statement)
139+
```
140+
141+
The pipeline automatically inserts code to check for error return codes when multiple commands are combined in a pipe.
142+
143+
## Running commands on a cluster
144+
145+
To run commands on a cluster, set `to_cluster=True`:
146+
147+
```python
148+
@files('*.unsorted.gz', suffix('.unsorted.gz'), '.sorted.gz')
149+
def sortFile(infile, outfile):
150+
to_cluster = True
151+
statement = '''gunzip < %(infile)s | sort -t %(tmpdir)s | gzip > %(outfile)s'''
152+
P.run(statement)
153+
```
154+
155+
Pipelines will use command line options such as `--cluster-queue` and `--cluster-priority` for global job control. For instance, to change the priority when starting the pipeline:
156+
157+
```bash
158+
python <pipeline_script.py> --cluster-priority=-20
159+
```
160+
161+
To set job-specific options, you can define additional variables:
162+
163+
```python
164+
@files('*.unsorted.gz', suffix('.unsorted.gz'), '.sorted.gz')
165+
def sortFile(infile, outfile):
166+
to_cluster = True
167+
job_queue = 'longjobs.q'
168+
job_priority = -10
169+
job_options = "-pe dedicated 4 -R y"
170+
statement = '''gunzip < %(infile)s | sort -t %(tmpdir)s | gzip > %(outfile)s'''
171+
P.run(statement)
172+
```
173+
174+
The statement above will run in the queue `longjobs.q` with a priority of `-10`. It will also be executed in the parallel environment `dedicated`, using at least four cores.
175+
176+
## Combining commands
177+
178+
To combine commands, use `&&` to ensure they execute in the intended order:
179+
180+
```python
181+
statement = """
182+
module load cutadapt &&
183+
cutadapt ...
184+
"""
185+
186+
P.run(statement)
187+
```
188+
189+
Without `&&`, the command would fail because the `cutadapt` command would execute as part of the `module load` statement.
190+
191+
## Useful information regarding decorators
192+
193+
For a full list of Ruffus decorators that control pipeline flow, see the [Ruffus documentation](http://www.ruffus.org.uk/decorators/decorators.html).
194+
195+
Here are some examples of modifying an input file name to transform it into the output filename:
196+
197+
### Using Suffix
198+
199+
```python
200+
@transform(pairs, suffix('.fastq.gz'), ('_trimmed.fastq.gz', '_trimmed.fastq.gz'))
201+
```
202+
203+
This will transform an input `<name_of_file>.fastq.gz` into an output `<name_of_file>_trimmed.fastq.gz`.
204+
205+
### Using Regex
206+
207+
```python
208+
@follows(mkdir("new_folder.dir"))
209+
@transform(pairs, regex('(\S+).fastq.gz'), ('new_folder.dir/\1_trimmed.fastq.gz', 'new_folder.dir/\1_trimmed.fastq.gz'))
210+
```
211+
212+
### Using Formatter
213+
214+
```python
215+
@follows(mkdir("new_folder.dir"))
216+
@transform(pairs, formatter('(\S+).fastq.gz'), ('new_folder.dir/{SAMPLE[0]}_trimmed.fastq.gz', 'new_folder.dir/{SAMPLE[0]}_trimmed.fastq.gz'))
217+
```
218+
219+
This documentation aims to provide a comprehensive guide to writing your own workflows and pipelines. For more advanced usage, please refer to the original CGAT-core and Ruffus documentation.
220+
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Cluster configuration
2+
3+
Currently, cgatcore supports the following workload managers: SGE, SLURM, Torque, and PBSPro. The default cluster options are set for SunGrid Engine (SGE). If you are using a different workload manager, you need to configure your cluster settings accordingly by creating a `.cgat.yml` file in your home directory.
4+
5+
This configuration file allows you to override the default settings. To view the hardcoded parameters for cgatcore, refer to the [parameters.py file](https://github.com/cgat-developers/cgat-core/blob/eb6d29e5fe1439de2318aeb5cdfa730f36ec3af4/cgatcore/pipeline/parameters.py#L67).
6+
7+
For an example of configuring a PBSPro workload manager, see the provided [config example](https://github.com/AntonioJBT/pipeline_example/blob/master/Docker_and_config_file_examples/cgat.yml).
8+
9+
The `.cgat.yml` file in your home directory will take precedence over the default cgatcore settings. For instance, adding the following configuration to `.cgat.yml` will implement cluster settings for PBSPro:
10+
11+
```yaml
12+
memory_resource: mem
13+
14+
options: -l walltime=00:10:00 -l select=1:ncpus=8:mem=1gb
15+
16+
queue_manager: pbspro
17+
18+
queue: NONE
19+
20+
parallel_environment: "dedicated"
21+
```
22+
23+
This setup specifies memory resource allocation (`mem`), runtime limits (`walltime`), selection of CPU and memory resources, and the use of the PBSPro queue manager, among other settings. Make sure to adjust the parameters according to your cluster environment to optimise the workload manager for your pipeline runs.

0 commit comments

Comments
 (0)