Skip to content

Commit

Permalink
Construct cellarr TileDB in HPC environments (#61)
Browse files Browse the repository at this point in the history
Tested on local HPC environments, update documentation, changelog and README.
  • Loading branch information
jkanche authored Dec 19, 2024
1 parent d679fa5 commit 2275dfd
Show file tree
Hide file tree
Showing 14 changed files with 802 additions and 4 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Changelog

## Version 0.5.0

- Construct cellarr TileDB files on HPC environments based on slurm
(reference: [#61](https://github.com/BiocPy/cellarr/pull/61))

## Version 0.4.0

- chore: Remove Python 3.8 (EOL).
Expand Down
48 changes: 47 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ datasets but can be generalized to store any 2-dimensional experimental data.

To get started, install the package from [PyPI](https://pypi.org/project/cellarr/)

```bash
```sh
pip install cellarr

## to include optional dependencies
Expand Down Expand Up @@ -123,6 +123,52 @@ print(dataset)

Check out the [documentation](https://biocpy.github.io/cellarr/tutorial.html) for more details.

### Building on HPC environments with `slurm`

To simplify building TileDB files on HPC environments that use `slurm`, there are a few steps you need to follow.

- Step 1: Construct a manifest file
A minimal manifest file (json) must contain the following fields
- `"files"`: A list of file path to the input `h5ad` objects.
- `"python_env"`: A set of commands to activate the Python environment containing this package and its dependencies.

Here’s an example of the manifest file:

```py
manifest = {
"files": your/list/of/files,
"python_env": """
ml Miniforge3
conda activate cellarr
python --version
which python
""",
"matrix_options": [
{
"matrix_name": "non_zero_cells",
"dtype": "uint32"
},
{
"matrix_name": "pseudo_bulk_log_normed",
"dtype": "float32"
}
],
}

import json
json.dump(manifest, open("your/path/to/manifest.json", "w"))
```

For more options, check out the [README](./src/cellarr/slurm/README.md).

- Step 2: Submit the job
Once your manifest file is ready, you can submit the necessary jobs using the `cellarr_build` CLI. Run the following command:

```sh
cellarr_build --input-manifest your/path/to/manifest.json --output-dir your/path/to/output --memory-per-job 8 --cpus-per-task 2
```

### Query a `CellArrDataset`

Users have the option to reuse the `dataset` object returned when building the dataset or by creating a `CellArrDataset` object by initializing it to the path where the files were created.
Expand Down
4 changes: 2 additions & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,8 @@ testing =

[options.entry_points]
# Add here console scripts like:
# console_scripts =
# script_name = cellarr.module:function
console_scripts =
cellarr_build = cellarr.slurm.build_cellarr_steps:main
# For example:
# console_scripts =
# fibonacci = cellarr.skeleton:run
Expand Down
2 changes: 1 addition & 1 deletion src/cellarr/buildutils_tiledb_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def create_tiledb_frame_from_column_names(
)


def create_tiledb_frame_from_dataframe(tiledb_uri_path: str, frame: List[str], column_types=dict):
def create_tiledb_frame_from_dataframe(tiledb_uri_path: str, frame: List[str], column_types: dict = None):
"""Create a TileDB file with the provided attributes to persistent storage.
This will materialize the array directory and all
Expand Down
65 changes: 65 additions & 0 deletions src/cellarr/slurm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@

# manifest json

```json
{
"files": [
"/path/to/dataset1.h5ad",
"/path/to/dataset2.h5ad"
],
"matrix_options": [
{
"matrix_name": "counts",
"dtype": "uint32"
},
{
"matrix_name": "normalized",
"dtype": "float32"
}
],
"gene_options": {
"feature_column": "index"
},
"sample_options": {
"metadata": {
"sample_1": {
"condition": "control",
"batch": "1"
},
"sample_2": {
"condition": "treatment",
"batch": "1"
}
}
},
"cell_options": {
"column_types": {
"cell_type": "ascii",
"quality_score": "float32"
},
},
"python_env": """
. /system/gredit/clientos/etc/profile

ml Miniforge3
conda activate biocpy_miniforge

~/.conda/envs/biocpy_miniforge/bin/python --version
which python
python --version
""",
}
```


Run

```sh

python build_cellarr_steps.py \
--input-manifest manifest.json \
--output-dir /path/to/output \
--memory-per-job 64 \
--cpus-per-task 4

```
Empty file added src/cellarr/slurm/__init__.py
Empty file.
Loading

0 comments on commit 2275dfd

Please sign in to comment.