-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #184 from cgat-developers/AC-docs
updated mkdocs to work with dynamin importing documentation
- Loading branch information
Showing
13 changed files
with
379 additions
and
446 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,243 @@ | ||
# Writing a workflow - Tutorial | ||
|
||
The explicit aim of cgat-core is to allow users to quickly and easily build their own computational pipelines that will speed up your analysis workflow. | ||
|
||
## Installation of cgat-core | ||
|
||
In order to begin writing a pipeline, you will need to install the cgat-core code (see installation instructions in the "Getting Started" section). | ||
|
||
## Tutorial start | ||
|
||
### Setting up the pipeline | ||
|
||
**1.** First, navigate to a directory where you want to start building your code: | ||
|
||
```bash | ||
mkdir test && cd test && mkdir configuration && touch configuration/pipeline.yml && touch pipeline_test.py && touch ModuleTest.py | ||
``` | ||
|
||
This command will create a directory called `test` in the current directory with the following layout: | ||
|
||
``` | ||
|-- configuration | ||
| \-- pipeline.yml | ||
|-- pipeline_test.py | ||
|-- ModuleTest.py | ||
``` | ||
|
||
The layout has the following components: | ||
|
||
- **pipeline_test.py**: This is the file that will contain all of the ruffus workflows. The file needs to be named in the format `pipeline_<name>.py`. | ||
- **test/**: Directory containing the configuration `.yml` file. The directory needs to have the same name as the `pipeline_<name>.py` file. This folder will contain the `pipeline.yml` configuration file. | ||
- **ModuleTest.py**: This file will contain functions that will be imported into the main ruffus workflow file (`pipeline_test.py`). | ||
|
||
**2.** View the source code within `pipeline_test.py` | ||
|
||
This is where the ruffus tasks will be written. The code begins with a docstring detailing the pipeline functionality. You should use this section to document your pipeline: | ||
|
||
```python | ||
'''This pipeline is a test and this is where the documentation goes ''' | ||
``` | ||
|
||
The pipeline then needs a few utility functions to help with executing the pipeline. | ||
|
||
- **Import statements**: You will need to import ruffus and cgatcore utilities: | ||
|
||
```python | ||
from ruffus import * | ||
import cgatcore.experiment as E | ||
from cgatcore import pipeline as P | ||
``` | ||
|
||
Importing `ruffus` allows ruffus decorators to be used within the pipeline. | ||
Importing `experiment` from `cgatcore` provides utility functions for argument parsing, logging, and record-keeping within scripts. | ||
Importing `pipeline` from `cgatcore` provides utility functions for interfacing CGAT ruffus pipelines with an HPC cluster, uploading data to a database, and parameterisation. | ||
|
||
You'll also need some Python modules: | ||
|
||
```python | ||
import os | ||
import sys | ||
``` | ||
|
||
- **Config parser**: This code helps with parsing the `pipeline.yml` file: | ||
|
||
```python | ||
# Load options from the config file | ||
PARAMS = P.get_parameters([ | ||
"%s/pipeline.yml" % os.path.splitext(__file__)[0], | ||
"../pipeline.yml", | ||
"pipeline.yml"]) | ||
``` | ||
|
||
- **Pipeline configuration**: We will add configurable variables to our `pipeline.yml` file so that we can modify the output of our pipeline. Open `pipeline.yml` and add the following: | ||
|
||
```yaml | ||
database: | ||
name: "csvdb" | ||
``` | ||
When you run the pipeline, the configuration variables (in this case `csvdb`) can be accessed in the pipeline by `PARAMS["database_name"]`. | ||
|
||
- **Database connection**: This code helps with connecting to an SQLite database: | ||
|
||
```python | ||
def connect(): | ||
'''Utility function to connect to the database. | ||
Use this method to connect to the pipeline database. | ||
Additional databases can be attached here as well. | ||
Returns an sqlite3 database handle. | ||
''' | ||
dbh = sqlite3.connect(PARAMS["database_name"]) | ||
return dbh | ||
``` | ||
|
||
- **Commandline parser**: This code allows the pipeline to parse arguments: | ||
|
||
```python | ||
def main(argv=None): | ||
if argv is None: | ||
argv = sys.argv | ||
P.main(argv) | ||
if __name__ == "__main__": | ||
sys.exit(P.main(sys.argv)) | ||
``` | ||
|
||
### Running the test pipeline | ||
|
||
You now have the bare bones layout of the pipeline, and you need some code to execute. Below is example code that you can copy and paste into your `pipeline_test.py` file. The code includes two ruffus `@transform` tasks that parse `pipeline.yml`. The first function, called `countWords`, contains a statement that counts the number of words in the file. The statement is then executed using the `P.run()` function. | ||
|
||
The second ruffus `@transform` function called `loadWordCounts` takes as input the output of the function `countWords` and loads the number of words into an SQLite database using `P.load()`. | ||
|
||
The third function, `full()`, is a dummy task that runs the entire pipeline. It has an `@follows` decorator that takes the `loadWordCounts` function, completing the pipeline chain. | ||
|
||
The following code should be pasted just before the **Commandline parser** arguments and after the **database connection** code: | ||
|
||
```python | ||
# --------------------------------------------------- | ||
# Specific pipeline tasks | ||
@transform("pipeline.yml", | ||
regex("(.*)\.(.*)"), | ||
r"\1.counts") | ||
def countWords(infile, outfile): | ||
'''Count the number of words in the pipeline configuration files.''' | ||
# The command line statement we want to execute | ||
statement = '''awk 'BEGIN { printf("word\tfreq\n"); } | ||
{for (i = 1; i <= NF; i++) freq[$i]++} | ||
END { for (word in freq) printf "%s\t%d\n", word, freq[word] }' | ||
< %(infile)s > %(outfile)s''' | ||
# Execute the command in the variable statement. | ||
P.run(statement) | ||
@transform(countWords, | ||
suffix(".counts"), | ||
"_counts.load") | ||
def loadWordCounts(infile, outfile): | ||
'''Load results of word counting into database.''' | ||
P.load(infile, outfile, "--add-index=word") | ||
# --------------------------------------------------- | ||
# Generic pipeline tasks | ||
@follows(loadWordCounts) | ||
def full(): | ||
pass | ||
``` | ||
|
||
To run the pipeline, navigate to the working directory and then run the pipeline: | ||
|
||
```bash | ||
python /location/to/code/pipeline_test.py config | ||
python /location/to/code/pipeline_test.py show full -v 5 | ||
``` | ||
|
||
This will place the `pipeline.yml` in the folder. Then run: | ||
|
||
```bash | ||
python /location/to/code/pipeline_test.py make full -v5 --local | ||
``` | ||
|
||
The pipeline will then execute and count the words in the `yml` file. | ||
|
||
### Modifying the test pipeline to build your own workflows | ||
|
||
The next step is to modify the basic code in the pipeline to fit your particular NGS workflow needs. For example, suppose you want to convert a SAM file into a BAM file, then perform flag stats on that output BAM file. The code and layout that we just wrote can be easily modified to perform this. | ||
|
||
The pipeline will have two steps: | ||
1. Identify all SAM files and convert them to BAM files. | ||
2. Take the output of step 1 and perform flag stats on that BAM file. | ||
|
||
The first step would involve writing a function to identify all `sam` files in a `data.dir/` directory and convert them to BAM files using `samtools view`. The second function would then take the output of the first function, perform `samtools flagstat`, and output the results as a flat `.txt` file. This would be written as follows: | ||
|
||
```python | ||
@transform("data.dir/*.sam", | ||
regex("data.dir/(\S+).sam"), | ||
r"\1.bam") | ||
def bamConvert(infile, outfile): | ||
'''Convert a SAM file into a BAM file using samtools view.''' | ||
statement = '''samtools view -bT /ifs/mirror/genomes/plain/hg19.fasta \ | ||
%(infile)s > %(outfile)s''' | ||
P.run(statement) | ||
@transform(bamConvert, | ||
suffix(".bam"), | ||
"_flagstats.txt") | ||
def bamFlagstats(infile, outfile): | ||
'''Perform flagstats on a BAM file.''' | ||
statement = '''samtools flagstat %(infile)s > %(outfile)s''' | ||
P.run(statement) | ||
``` | ||
|
||
To run the pipeline: | ||
|
||
```bash | ||
python /path/to/file/pipeline_test.py make full -v5 | ||
``` | ||
|
||
The BAM files and flagstats outputs should be generated. | ||
|
||
### Parameterising the code using the `.yml` file | ||
|
||
As a philosophy, we try and avoid any hardcoded parameters, so that any variables can be easily modified by the user without changing the code. | ||
|
||
Looking at the code above, the hardcoded link to the `hg19.fasta` file can be added as a customisable parameter, allowing users to specify any FASTA file depending on the genome build used. In the `pipeline.yml`, add: | ||
|
||
```yaml | ||
genome: | ||
fasta: /ifs/mirror/genomes/plain/hg19.fasta | ||
``` | ||
|
||
In the `pipeline_test.py` code, the value can be accessed via `PARAMS["genome_fasta"]`. | ||
Therefore, the code for parsing BAM files can be modified as follows: | ||
|
||
```python | ||
@transform("data.dir/*.sam", | ||
regex("data.dir/(\S+).sam"), | ||
r"\1.bam") | ||
def bamConvert(infile, outfile): | ||
'''Convert a SAM file into a BAM file using samtools view.''' | ||
genome_fasta = PARAMS["genome_fasta"] | ||
statement = '''samtools view -bT %(genome_fasta)s \ | ||
%(infile)s > %(outfile)s''' | ||
P.run(statement) | ||
@transform(bamConvert, | ||
suffix(".bam"), | ||
"_flagstats.txt") | ||
def bamFlagstats(infile, outfile): | ||
'''Perform flagstats on a BAM file.''' | ||
statement = '''samtools flagstat %(infile)s > %(outfile)s''' | ||
P.run(statement) | ||
``` | ||
|
||
Running the code again will generate the same output. However, if you had BAM files that came from a different genome build, the parameter in the `yml` file can be easily modified, the output files deleted, and the pipeline run again with the new configuration values. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# CGATcore CSV2DB Module | ||
|
||
::: cgatcore.csv2db | ||
:members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# CGATcore Database Module | ||
|
||
::: cgatcore.database | ||
:members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# CGATcore Experiment Module | ||
|
||
::: cgatcore.experiment | ||
:members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# CGATcore IOTools Module | ||
|
||
::: cgatcore.iotools | ||
:members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# CGATcore Logfile Module | ||
|
||
::: cgatcore.logfile | ||
:members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# CGATcore Pipeline Module | ||
|
||
::: cgatcore.pipeline | ||
:members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
# Installation | ||
|
||
The following sections describe how to install the `cgatcore` framework. | ||
|
||
## Conda installation | ||
|
||
The preferred method of installation is using Conda. If you do not have Conda installed, you can install it using [Miniconda](https://conda.io/miniconda.html) or [Anaconda](https://www.anaconda.com/download/#macos). | ||
|
||
`cgatcore` is installed via the Bioconda channel, and the recipe can be found on [GitHub](https://github.com/bioconda/bioconda-recipes/tree/b1a943da5a73b4c3fad93fdf281915b397401908/recipes/cgat-core). To install `cgatcore`, run the following command: | ||
|
||
```bash | ||
conda install -c conda-forge -c bioconda cgatcore | ||
``` | ||
|
||
## Pip installation | ||
|
||
We recommend installation through Conda because it manages dependencies automatically. However, `cgatcore` is generally lightweight and can also be installed using the `pip` package manager. Note that you may need to manually install other dependencies as needed: | ||
|
||
```bash | ||
pip install cgatcore | ||
``` | ||
|
||
## Automated installation | ||
|
||
The preferred method to install `cgatcore` is using Conda. However, we have also created a Bash installation script, which uses [Conda](https://conda.io/docs/) under the hood. | ||
|
||
Here are the steps: | ||
|
||
```bash | ||
# Download the installation script: | ||
curl -O https://raw.githubusercontent.com/cgat-developers/cgat-core/master/install.sh | ||
|
||
# See help: | ||
bash install.sh | ||
|
||
# Install the development version (recommended, as there is no production version yet): | ||
bash install.sh --devel [--location </full/path/to/folder/without/trailing/slash>] | ||
|
||
# To download the code in Git format instead of the default zip format, use: | ||
--git # for an HTTPS clone | ||
--git-ssh # for an SSH clone (you need to be a cgat-developer contributor on GitHub to do this) | ||
|
||
# Enable the Conda environment as instructed by the installation script | ||
# Note: you might want to automate this by adding the following instructions to your .bashrc | ||
source </full/path/to/folder/without/trailing/slash>/conda-install/etc/profile.d/conda.sh | ||
conda activate base | ||
conda activate cgat-c | ||
``` | ||
|
||
The installation script will place everything under the specified location. The aim of the script is to provide a portable installation that does not interfere with existing software environments. As a result, you will have a dedicated Conda environment that can be activated as needed to work with `cgatcore`. | ||
|
||
## Manual installation | ||
|
||
To obtain the latest code, check it out from the public Git repository and activate it: | ||
|
||
```bash | ||
git clone https://github.com/cgat-developers/cgat-core.git | ||
cd cgat-core | ||
python setup.py develop | ||
``` | ||
|
||
To update to the latest version, simply pull the latest changes: | ||
|
||
```bash | ||
git pull | ||
``` | ||
|
||
## Installing additional software | ||
|
||
When building your own workflows, we recommend using Conda to install software into your environment where possible. This ensures compatibility and ease of installation. | ||
|
||
To search for and install a package using Conda: | ||
|
||
```bash | ||
conda search <package> | ||
conda install <package> | ||
``` | ||
|
||
## Accessing the libdrmaa shared library | ||
|
||
You may also need access to the `libdrmaa.so.1.0` C library, which can often be installed as part of the `libdrmaa-dev` package on most Unix systems. Once installed, you may need to specify the location of the DRMAA library if it is not in a default library path. Set the `DRMAA_LIBRARY_PATH` environment variable to point to the library location. | ||
|
||
To set this variable permanently, add the following line to your `.bashrc` file (adjusting the path as necessary): | ||
|
||
```bash | ||
export DRMAA_LIBRARY_PATH=/usr/lib/libdrmaa.so.1.0 | ||
``` | ||
|
||
[Conda documentation](https://conda.io) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.