Skip to content

Commit 55ad2cf

Browse files
authored
Merge pull request #194 from cgat-developers/AC-document
Ac document
2 parents 145bf4e + 70fea88 commit 55ad2cf

File tree

13 files changed

+1805
-120
lines changed

13 files changed

+1805
-120
lines changed

cgatcore/pipeline/__init__.py

Lines changed: 47 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,53 @@
55
=============================================
66
77
This module provides a comprehensive set of tools to facilitate the creation and management
8-
of data processing pipelines using CGAT Ruffus. It includes functionalities for pipeline control,
9-
logging, parameterization, task execution, database uploads, temporary file management, and
10-
integration with AWS S3.
8+
of data processing pipelines using CGAT Ruffus. It includes functionalities for:
9+
10+
1. Pipeline Control
11+
- Task execution and dependency management
12+
- Command-line interface for pipeline operations
13+
- Logging and error handling
14+
15+
2. Resource Management
16+
- Cluster job submission and monitoring
17+
- Memory and CPU allocation
18+
- Temporary file handling
19+
20+
3. Configuration
21+
- Parameter management via YAML configuration
22+
- Cluster settings customization
23+
- Pipeline state persistence
24+
25+
4. Cloud Integration
26+
- AWS S3 support for input/output files
27+
- Cloud-aware pipeline decorators
28+
- Remote file handling
29+
30+
Example Usage
31+
------------
32+
A basic pipeline using local files:
33+
34+
.. code-block:: python
35+
36+
from cgatcore import pipeline as P
37+
38+
# Standard pipeline task
39+
@P.transform("input.txt", suffix(".txt"), ".processed")
40+
def process_local_file(infile, outfile):
41+
# Processing logic here
42+
pass
43+
44+
Using S3 integration:
45+
46+
.. code-block:: python
47+
48+
# S3-aware pipeline task
49+
@P.s3_transform("s3://bucket/input.txt", suffix(".txt"), ".processed")
50+
def process_s3_file(infile, outfile):
51+
# Processing logic here
52+
pass
53+
54+
For detailed documentation, see: https://cgat-core.readthedocs.io/
1155
"""
1256

1357

cgatcore/pipeline/cluster.py

Lines changed: 51 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,54 @@
1-
'''cluster.py - cluster utility functions for ruffus pipelines
2-
==============================================================
3-
4-
This module abstracts the DRMAA native specification and provides
5-
convenience functions for running Drmaa jobs.
6-
7-
Currently SGE, SLURM, Torque and PBSPro are supported.
8-
9-
Reference
10-
---------
11-
12-
'''
1+
"""
2+
cluster.py - Cluster job management for CGAT pipelines
3+
====================================================
4+
5+
This module provides functionality for submitting and managing jobs on various
6+
cluster platforms (SLURM, SGE, PBS/Torque). It handles:
7+
8+
1. Job Submission
9+
- Resource allocation (memory, CPU cores)
10+
- Queue selection and prioritization
11+
- Job dependencies and scheduling
12+
13+
2. Platform Support
14+
- SLURM Workload Manager
15+
- Sun Grid Engine (SGE)
16+
- PBS/Torque
17+
- Local execution (multiprocessing)
18+
19+
3. Resource Management
20+
- Memory limits and monitoring
21+
- CPU allocation
22+
- Job runtime constraints
23+
- Temporary directory handling
24+
25+
Configuration
26+
------------
27+
Cluster settings can be configured in `.cgat.yml`:
28+
29+
.. code-block:: yaml
30+
31+
cluster:
32+
queue_manager: slurm
33+
queue: main
34+
memory_resource: mem
35+
memory_default: 4G
36+
parallel_environment: dedicated
37+
38+
Available Parameters
39+
------------------
40+
- cluster_queue: Cluster queue to use (default: all.q)
41+
- cluster_priority: Job priority (-10 to 10, default: -10)
42+
- cluster_num_jobs: Maximum concurrent jobs (default: 100)
43+
- cluster_memory_resource: Memory resource identifier
44+
- cluster_memory_default: Default job memory (default: 4G)
45+
- cluster_memory_ulimit: Enable memory limits via ulimit
46+
- cluster_parallel_environment: Parallel environment name
47+
- cluster_queue_manager: Queue management system
48+
- cluster_tmpdir: Temporary directory location
49+
50+
For detailed documentation, see: https://cgat-core.readthedocs.io/
51+
"""
1352

1453
import re
1554
import math
@@ -484,7 +523,6 @@ def get_native_specification(self,
484523
spec.append("-q {}".format(kwargs["queue"]))
485524

486525
spec.append(kwargs.get("options", ""))
487-
488526
return spec
489527

490528
def update_template(self, jt):

cgatcore/pipeline/execution.py

Lines changed: 55 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,62 @@
1-
"""execution.py - Job control for ruffus pipelines
2-
=========================================================
1+
"""
2+
execution.py - Task execution for CGAT pipelines
3+
==============================================
4+
5+
This module handles the execution of pipeline tasks, providing support for:
6+
7+
1. Job Execution
8+
- Local execution via subprocess
9+
- Cluster job submission
10+
- Python function execution
11+
- Container-based execution
12+
13+
2. Resource Management
14+
- Memory monitoring and limits
15+
- CPU allocation
16+
- Runtime constraints
17+
- Working directory management
18+
19+
3. Error Handling
20+
- Job failure detection
21+
- Retry mechanisms
22+
- Error logging and reporting
23+
- Clean-up procedures
24+
25+
4. Execution Modes
26+
- Synchronous (blocking) execution
27+
- Asynchronous job submission
28+
- Parallel task execution
29+
- Dependency-aware scheduling
330
4-
Session
5-
-------
31+
Usage Examples
32+
-------------
33+
1. Submit a command to the cluster:
634
7-
This module manages a DRMAA session. :func:`start_session`
8-
starts a session and :func:`close_session` closes it.
35+
.. code-block:: python
36+
37+
statement = "samtools sort input.bam -o output.bam"
38+
job_options = "-l mem_free=4G"
39+
job_threads = 4
40+
41+
execution.run(statement,
42+
job_options=job_options,
43+
job_threads=job_threads)
44+
45+
2. Execute a Python function:
46+
47+
.. code-block:: python
48+
49+
def process_data(infile, outfile):
50+
# Processing logic here
51+
pass
952
10-
Reference
11-
---------
53+
execution.submit(module="my_module",
54+
function="process_data",
55+
infiles="input.txt",
56+
outfiles="output.txt",
57+
job_memory="4G")
1258
59+
For detailed documentation, see: https://cgat-core.readthedocs.io/
1360
"""
1461

1562
import collections

docs/function_doc/pipeline.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,129 @@
11
# CGATcore Pipeline Module
22

3+
The `pipeline` module is the core component of CGAT-core, providing essential functionality for building and executing computational pipelines.
4+
5+
## Core Functions
6+
7+
### Pipeline Decorators
8+
9+
```python
10+
@transform(input_files, suffix(".input"), ".output")
11+
def task_function(infile, outfile):
12+
"""Transform a single input file to an output file."""
13+
pass
14+
15+
@merge(input_files, "output.txt")
16+
def merge_task(infiles, outfile):
17+
"""Merge multiple input files into a single output."""
18+
pass
19+
20+
@split(input_file, "*.split")
21+
def split_task(infile, outfiles):
22+
"""Split a single input file into multiple outputs."""
23+
pass
24+
25+
@follows(previous_task)
26+
def dependent_task():
27+
"""Execute after previous_task completes."""
28+
pass
29+
```
30+
31+
### S3-Aware Decorators
32+
33+
```python
34+
@s3_transform("s3://bucket/input.txt", suffix(".txt"), ".processed")
35+
def process_s3_file(infile, outfile):
36+
"""Process files directly from S3."""
37+
pass
38+
39+
@s3_merge(["s3://bucket/*.txt"], "s3://bucket/merged.txt")
40+
def merge_s3_files(infiles, outfile):
41+
"""Merge multiple S3 files."""
42+
pass
43+
```
44+
45+
## Configuration Functions
46+
47+
### Pipeline Setup
48+
```python
49+
# Initialize pipeline
50+
pipeline.initialize(options)
51+
52+
# Get pipeline parameters
53+
params = pipeline.get_params()
54+
55+
# Configure cluster execution
56+
pipeline.setup_cluster()
57+
```
58+
59+
### Resource Management
60+
```python
61+
# Set memory requirements
62+
pipeline.set_job_memory("4G")
63+
64+
# Set CPU requirements
65+
pipeline.set_job_threads(4)
66+
67+
# Configure temporary directory
68+
pipeline.set_tmpdir("/path/to/tmp")
69+
```
70+
71+
## Execution Functions
72+
73+
### Running Tasks
74+
```python
75+
# Execute a command
76+
pipeline.run("samtools sort input.bam")
77+
78+
# Submit a Python function
79+
pipeline.submit(
80+
module="my_module",
81+
function="process_data",
82+
infiles="input.txt",
83+
outfiles="output.txt"
84+
)
85+
```
86+
87+
### Job Control
88+
```python
89+
# Check job status
90+
pipeline.is_running(job_id)
91+
92+
# Wait for job completion
93+
pipeline.wait_for_jobs()
94+
95+
# Clean up temporary files
96+
pipeline.cleanup()
97+
```
98+
99+
## Error Handling
100+
101+
```python
102+
try:
103+
pipeline.run("risky_command")
104+
except pipeline.PipelineError as e:
105+
pipeline.handle_error(e)
106+
```
107+
108+
## Best Practices
109+
110+
1. **Resource Management**
111+
- Always specify memory and CPU requirements
112+
- Use appropriate cluster queue settings
113+
- Clean up temporary files
114+
115+
2. **Error Handling**
116+
- Implement proper error checking
117+
- Use pipeline.log for logging
118+
- Handle temporary file cleanup
119+
120+
3. **Performance**
121+
- Use appropriate chunk sizes for parallel processing
122+
- Monitor resource usage
123+
- Optimize cluster settings
124+
125+
For more details, see the [Pipeline Overview](../pipeline_modules/overview.md) and [Writing Workflows](../defining_workflow/writing_workflows.md) guides.
126+
3127
::: cgatcore.pipeline
4128
:members:
5129
:show-inheritance:

0 commit comments

Comments
 (0)