Merge pull request #184 from cgat-developers/AC-docs

Acribbs · web-flow · commit ffde0490b009 · 2024-11-13T14:20:28.000Z
updated mkdocs to work with dynamin importing documentation
diff --git a/.github/workflows/cgatcore_python.yml b/.github/workflows/cgatcore_python.yml
@@ -68,7 +68,7 @@ jobs:
 
       - name: Install MkDocs and Dependencies
         run: |
-          pip install mkdocs mkdocs-material
+          pip install mkdocs mkdocs-material mkdocstrings[python]
 
       - name: Build and Deploy MkDocs Site
         run: mkdocs gh-deploy --force --clean
diff --git a/docs/defining_workflow/tutorial.md b/docs/defining_workflow/tutorial.md
@@ -0,0 +1,243 @@
+# Writing a workflow - Tutorial
+
+The explicit aim of cgat-core is to allow users to quickly and easily build their own computational pipelines that will speed up your analysis workflow.
+
+## Installation of cgat-core
+
+In order to begin writing a pipeline, you will need to install the cgat-core code (see installation instructions in the "Getting Started" section).
+
+## Tutorial start
+
+### Setting up the pipeline
+
+**1.** First, navigate to a directory where you want to start building your code:
+
+```bash
+mkdir test && cd test && mkdir configuration && touch configuration/pipeline.yml && touch pipeline_test.py && touch ModuleTest.py
+```
+
+This command will create a directory called `test` in the current directory with the following layout:
+
+```
+|-- configuration
+|   \-- pipeline.yml
+|-- pipeline_test.py
+|-- ModuleTest.py
+```
+
+The layout has the following components:
+
+- **pipeline_test.py**: This is the file that will contain all of the ruffus workflows. The file needs to be named in the format `pipeline_<name>.py`.
+- **test/**: Directory containing the configuration `.yml` file. The directory needs to have the same name as the `pipeline_<name>.py` file. This folder will contain the `pipeline.yml` configuration file.
+- **ModuleTest.py**: This file will contain functions that will be imported into the main ruffus workflow file (`pipeline_test.py`).
+
+**2.** View the source code within `pipeline_test.py`
+
+This is where the ruffus tasks will be written. The code begins with a docstring detailing the pipeline functionality. You should use this section to document your pipeline:
+
+```python
+'''This pipeline is a test and this is where the documentation goes '''
+```
+
+The pipeline then needs a few utility functions to help with executing the pipeline.
+
+- **Import statements**: You will need to import ruffus and cgatcore utilities:
+
+```python
+from ruffus import *
+import cgatcore.experiment as E
+from cgatcore import pipeline as P
+```
+
+Importing `ruffus` allows ruffus decorators to be used within the pipeline.
+Importing `experiment` from `cgatcore` provides utility functions for argument parsing, logging, and record-keeping within scripts.
+Importing `pipeline` from `cgatcore` provides utility functions for interfacing CGAT ruffus pipelines with an HPC cluster, uploading data to a database, and parameterisation.
+
+You'll also need some Python modules:
+
+```python
+import os
+import sys
+```
+
+- **Config parser**: This code helps with parsing the `pipeline.yml` file:
+
+```python
+# Load options from the config file
+PARAMS = P.get_parameters([
+    "%s/pipeline.yml" % os.path.splitext(__file__)[0],
+    "../pipeline.yml",
+    "pipeline.yml"])
+```
+
+- **Pipeline configuration**: We will add configurable variables to our `pipeline.yml` file so that we can modify the output of our pipeline. Open `pipeline.yml` and add the following:
+
+```yaml
+database:
+   name: "csvdb"
+```
+
+When you run the pipeline, the configuration variables (in this case `csvdb`) can be accessed in the pipeline by `PARAMS["database_name"]`.
+
+- **Database connection**: This code helps with connecting to an SQLite database:
+
+```python
+def connect():
+    '''Utility function to connect to the database.
+
+    Use this method to connect to the pipeline database.
+    Additional databases can be attached here as well.
+
+    Returns an sqlite3 database handle.
+    '''
+    dbh = sqlite3.connect(PARAMS["database_name"])
+    return dbh
+```
+
+- **Commandline parser**: This code allows the pipeline to parse arguments:
+
+```python
+def main(argv=None):
+    if argv is None:
+        argv = sys.argv
+    P.main(argv)
+
+if __name__ == "__main__":
+    sys.exit(P.main(sys.argv))
+```
+
+### Running the test pipeline
+
+You now have the bare bones layout of the pipeline, and you need some code to execute. Below is example code that you can copy and paste into your `pipeline_test.py` file. The code includes two ruffus `@transform` tasks that parse `pipeline.yml`. The first function, called `countWords`, contains a statement that counts the number of words in the file. The statement is then executed using the `P.run()` function.
+
+The second ruffus `@transform` function called `loadWordCounts` takes as input the output of the function `countWords` and loads the number of words into an SQLite database using `P.load()`.
+
+The third function, `full()`, is a dummy task that runs the entire pipeline. It has an `@follows` decorator that takes the `loadWordCounts` function, completing the pipeline chain.
+
+The following code should be pasted just before the **Commandline parser** arguments and after the **database connection** code:
+
+```python
+# ---------------------------------------------------
+# Specific pipeline tasks
+@transform("pipeline.yml",
+           regex("(.*)\.(.*)"),
+           r"\1.counts")
+def countWords(infile, outfile):
+    '''Count the number of words in the pipeline configuration files.'''
+
+    # The command line statement we want to execute
+    statement = '''awk 'BEGIN { printf("word\tfreq\n"); }
+    {for (i = 1; i <= NF; i++) freq[$i]++}
+    END { for (word in freq) printf "%s\t%d\n", word, freq[word] }'
+    < %(infile)s > %(outfile)s'''
+
+    # Execute the command in the variable statement.
+    P.run(statement)
+
+@transform(countWords,
+           suffix(".counts"),
+           "_counts.load")
+def loadWordCounts(infile, outfile):
+    '''Load results of word counting into database.'''
+    P.load(infile, outfile, "--add-index=word")
+
+# ---------------------------------------------------
+# Generic pipeline tasks
+@follows(loadWordCounts)
+def full():
+    pass
+```
+
+To run the pipeline, navigate to the working directory and then run the pipeline:
+
+```bash
+python /location/to/code/pipeline_test.py config
+python /location/to/code/pipeline_test.py show full -v 5
+```
+
+This will place the `pipeline.yml` in the folder. Then run:
+
+```bash
+python /location/to/code/pipeline_test.py make full -v5 --local
+```
+
+The pipeline will then execute and count the words in the `yml` file.
+
+### Modifying the test pipeline to build your own workflows
+
+The next step is to modify the basic code in the pipeline to fit your particular NGS workflow needs. For example, suppose you want to convert a SAM file into a BAM file, then perform flag stats on that output BAM file. The code and layout that we just wrote can be easily modified to perform this.
+
+The pipeline will have two steps:
+1. Identify all SAM files and convert them to BAM files.
+2. Take the output of step 1 and perform flag stats on that BAM file.
+
+The first step would involve writing a function to identify all `sam` files in a `data.dir/` directory and convert them to BAM files using `samtools view`. The second function would then take the output of the first function, perform `samtools flagstat`, and output the results as a flat `.txt` file. This would be written as follows:
+
+```python
+@transform("data.dir/*.sam",
+           regex("data.dir/(\S+).sam"),
+           r"\1.bam")
+def bamConvert(infile, outfile):
+    '''Convert a SAM file into a BAM file using samtools view.'''
+    
+    statement = '''samtools view -bT /ifs/mirror/genomes/plain/hg19.fasta \
+                   %(infile)s > %(outfile)s'''
+    P.run(statement)
+
+@transform(bamConvert,
+           suffix(".bam"),
+           "_flagstats.txt")
+def bamFlagstats(infile, outfile):
+    '''Perform flagstats on a BAM file.'''
+    
+    statement = '''samtools flagstat %(infile)s > %(outfile)s'''
+    P.run(statement)
+```
+
+To run the pipeline:
+
+```bash
+python /path/to/file/pipeline_test.py make full -v5
+```
+
+The BAM files and flagstats outputs should be generated.
+
+### Parameterising the code using the `.yml` file
+
+As a philosophy, we try and avoid any hardcoded parameters, so that any variables can be easily modified by the user without changing the code.
+
+Looking at the code above, the hardcoded link to the `hg19.fasta` file can be added as a customisable parameter, allowing users to specify any FASTA file depending on the genome build used. In the `pipeline.yml`, add:
+
+```yaml
+genome:
+    fasta: /ifs/mirror/genomes/plain/hg19.fasta
+```
+
+In the `pipeline_test.py` code, the value can be accessed via `PARAMS["genome_fasta"]`.
+Therefore, the code for parsing BAM files can be modified as follows:
+
+```python
+@transform("data.dir/*.sam",
+           regex("data.dir/(\S+).sam"),
+           r"\1.bam")
+def bamConvert(infile, outfile):
+    '''Convert a SAM file into a BAM file using samtools view.'''
+
+    genome_fasta = PARAMS["genome_fasta"]
+
+    statement = '''samtools view -bT %(genome_fasta)s \
+                   %(infile)s > %(outfile)s'''
+    P.run(statement)
+
+@transform(bamConvert,
+           suffix(".bam"),
+           "_flagstats.txt")
+def bamFlagstats(infile, outfile):
+    '''Perform flagstats on a BAM file.'''
+    
+    statement = '''samtools flagstat %(infile)s > %(outfile)s'''
+    P.run(statement)
+```
+
+Running the code again will generate the same output. However, if you had BAM files that came from a different genome build, the parameter in the `yml` file can be easily modified, the output files deleted, and the pipeline run again with the new configuration values.
+
diff --git a/docs/function_doc/csv2db.md b/docs/function_doc/csv2db.md
@@ -0,0 +1,5 @@
+# CGATcore CSV2DB Module
+
+::: cgatcore.csv2db
+    :members:
+    :show-inheritance:
diff --git a/docs/function_doc/database.md b/docs/function_doc/database.md
@@ -0,0 +1,5 @@
+# CGATcore Database Module
+
+::: cgatcore.database
+    :members:
+    :show-inheritance:
diff --git a/docs/function_doc/experiment.md b/docs/function_doc/experiment.md
@@ -0,0 +1,5 @@
+# CGATcore Experiment Module
+
+::: cgatcore.experiment
+    :members:
+    :show-inheritance:
diff --git a/docs/function_doc/iotools.md b/docs/function_doc/iotools.md
@@ -0,0 +1,5 @@
+# CGATcore IOTools Module
+
+::: cgatcore.iotools
+    :members:
+    :show-inheritance:
diff --git a/docs/function_doc/logfile.md b/docs/function_doc/logfile.md
@@ -0,0 +1,5 @@
+# CGATcore Logfile Module
+
+::: cgatcore.logfile
+    :members:
+    :show-inheritance:
diff --git a/docs/function_doc/pipeline.md b/docs/function_doc/pipeline.md
@@ -0,0 +1,5 @@
+# CGATcore Pipeline Module
+
+::: cgatcore.pipeline
+    :members:
+    :show-inheritance:
diff --git a/docs/getting_started/installation.md b/docs/getting_started/installation.md
@@ -0,0 +1,90 @@
+# Installation
+
+The following sections describe how to install the `cgatcore` framework.
+
+## Conda installation
+
+The preferred method of installation is using Conda. If you do not have Conda installed, you can install it using [Miniconda](https://conda.io/miniconda.html) or [Anaconda](https://www.anaconda.com/download/#macos).
+
+`cgatcore` is installed via the Bioconda channel, and the recipe can be found on [GitHub](https://github.com/bioconda/bioconda-recipes/tree/b1a943da5a73b4c3fad93fdf281915b397401908/recipes/cgat-core). To install `cgatcore`, run the following command:
+
+```bash
+conda install -c conda-forge -c bioconda cgatcore
+```
+
+## Pip installation
+
+We recommend installation through Conda because it manages dependencies automatically. However, `cgatcore` is generally lightweight and can also be installed using the `pip` package manager. Note that you may need to manually install other dependencies as needed:
+
+```bash
+pip install cgatcore
+```
+
+## Automated installation
+
+The preferred method to install `cgatcore` is using Conda. However, we have also created a Bash installation script, which uses [Conda](https://conda.io/docs/) under the hood.
+
+Here are the steps:
+
+```bash
+# Download the installation script:
+curl -O https://raw.githubusercontent.com/cgat-developers/cgat-core/master/install.sh
+
+# See help:
+bash install.sh
+
+# Install the development version (recommended, as there is no production version yet):
+bash install.sh --devel [--location </full/path/to/folder/without/trailing/slash>]
+
+# To download the code in Git format instead of the default zip format, use:
+--git # for an HTTPS clone
+--git-ssh # for an SSH clone (you need to be a cgat-developer contributor on GitHub to do this)
+
+# Enable the Conda environment as instructed by the installation script
+# Note: you might want to automate this by adding the following instructions to your .bashrc
+source </full/path/to/folder/without/trailing/slash>/conda-install/etc/profile.d/conda.sh
+conda activate base
+conda activate cgat-c
+```
+
+The installation script will place everything under the specified location. The aim of the script is to provide a portable installation that does not interfere with existing software environments. As a result, you will have a dedicated Conda environment that can be activated as needed to work with `cgatcore`.
+
+## Manual installation
+
+To obtain the latest code, check it out from the public Git repository and activate it:
+
+```bash
+git clone https://github.com/cgat-developers/cgat-core.git
+cd cgat-core
+python setup.py develop
+```
+
+To update to the latest version, simply pull the latest changes:
+
+```bash
+git pull
+```
+
+## Installing additional software
+
+When building your own workflows, we recommend using Conda to install software into your environment where possible. This ensures compatibility and ease of installation.
+
+To search for and install a package using Conda:
+
+```bash
+conda search <package>
+conda install <package>
+```
+
+## Accessing the libdrmaa shared library
+
+You may also need access to the `libdrmaa.so.1.0` C library, which can often be installed as part of the `libdrmaa-dev` package on most Unix systems. Once installed, you may need to specify the location of the DRMAA library if it is not in a default library path. Set the `DRMAA_LIBRARY_PATH` environment variable to point to the library location.
+
+To set this variable permanently, add the following line to your `.bashrc` file (adjusting the path as necessary):
+
+```bash
+export DRMAA_LIBRARY_PATH=/usr/lib/libdrmaa.so.1.0
+```
+
+[Conda documentation](https://conda.io)
+
diff --git a/docs/getting_started/run_parameters.md b/docs/getting_started/run_parameters.md
@@ -1,6 +1,6 @@
 # Cluster configuration
 
-Currently, cgatcore supports the following workload managers: SGE, SLURM, Torque, and PBSPro. The default cluster options are set for SunGrid Engine (SGE). If you are using a different workload manager, you need to configure your cluster settings accordingly by creating a `.cgat.yml` file in your home directory.
+Currently, cgatcore supports the following workload managers: SGE, SLURM and Torque. The default cluster options are set for SunGrid Engine (SGE). If you are using a different workload manager, you need to configure your cluster settings accordingly by creating a `.cgat.yml` file in your home directory.
 
 This configuration file allows you to override the default settings. To view the hardcoded parameters for cgatcore, refer to the [parameters.py file](https://github.com/cgat-developers/cgat-core/blob/eb6d29e5fe1439de2318aeb5cdfa730f36ec3af4/cgatcore/pipeline/parameters.py#L67).
 
diff --git a/docs/getting_started/tutorial.md b/docs/getting_started/tutorial.md
diff --git a/docs/getting_started/writing_workflow.md b/docs/getting_started/writing_workflow.md
diff --git a/mkdocs.yml b/mkdocs.yml