Prostate Cancer Spatial Transcriptomics Pipeline

This repository runs a Xenium-based spatial transcriptomics pipeline. The workflow has two required execution steps:

Create or choose a run configuration YAML.
uv run main.py --config <path-to-config>

The pipeline ingests raw Xenium output, then runs preprocessing, clustering, marker analysis, LLM-based annotation, spatial domain analysis, and colocalization.

Prerequisites

Install uv by following Astral's official instructions: uv installation guide. If you are new to uv, it is a fast Python package and environment manager; in this repository, uv run ... will create/use the project environment and run the scripts with the locked dependencies.

Install Ollama using the official docs: Ollama quickstart or download page. This pipeline uses a local Ollama-compatible server for annotation, so Ollama must be installed locally and running before main.py.

Required model and server

The annotation stage requires the llama3.1:8b model.

Pull the model once:

ollama pull llama3.1:8b

Start the local Ollama server in a separate terminal before running the pipeline:

ollama serve

The pipeline expects the Ollama API at http://localhost:11434, which is the default server address used by src/annotation.py.

Configure a run

Use config.example.yaml as the schema/reference config, then create a named run config outside pipeline/, such as ../configs/bone.yaml.

samples: list of sample IDs and raw Xenium output directories
output_directory: path where processed data, analysis artifacts, figures, and logs should be written
annotation_model: LLM model name, defaulting to llama3.1:8b
pipeline: numeric analysis parameters

Relative paths inside a config file are resolved relative to that config file's directory.

Run the pipeline end-to-end

From the pipeline/ directory, run all stages:

uv run main.py --config ../configs/bone.yaml

Run selected stages

Stages always execute in pipeline order, even if they are provided out of order:

uv run main.py --config ../configs/bone.yaml --stage ingest preprocess
uv run main.py --config ../configs/bone.yaml --stage annotate domains colocalization

Available stages are:

ingest
preprocess
annotate
domains
colocalization

Stage 0: ingest raw data

The ingest stage reads the raw Xenium samples from samples and writes the merged AnnData object:

processed/processed.h5ad

Stage 1: preprocess and cluster

The preprocess stage reads processed/processed.h5ad, computes QC metrics, filters cells/genes, normalizes and scales expression, runs PCA, optionally applies Harmony across sample_id, builds neighbors, computes UMAP and Leiden clusters, ranks marker genes, and writes cluster labels/enriched marker lists.

For multi-sample runs, the production pipeline stores the standard pre/post-Harmony sample-level diagnostic through analysis.run_clustering and plotting.plot_harmony_diagnostic. More exploratory Harmony diagnostics, such as core-specific before/after UMAPs, belong in notebooks/ rather than src/ unless they are used by main.py or shared production code.

Stage 2: annotate clusters

The annotation stage sends per-cluster enriched gene lists to the local Ollama model and writes:

analysis/cluster_annotations.json
figures/umap_leiden.png
per-sample figures/cell_type_overlays/*.png

It also maps the returned labels onto obs["cell_type"] in processed/processed.h5ad.

Stage 3: spatial domains

The domain stage computes local neighborhood composition, clusters those vectors into spatial domains, asks the LLM for microenvironment-style domain labels, and writes:

analysis/spatial_domain_annotations.json
per-sample analysis/<sample_id>/spatial_domain_labels.csv
per-sample figures/spatial_domain_overlays/*.png

Stage 4: colocalization

The colocalization stage computes observed cell-type contact matrices and permutation-based enrichment/depletion statistics while keeping coordinates and the neighbor graph fixed. It writes heatmaps for raw contact counts, row-normalized contact proportions, log2 fold enrichment, and significant-only fold enrichment.

Output layout

Under output_directory, the pipeline creates:

processed/: persistent processed AnnData, especially processed/processed.h5ad
analysis/: JSON/CSV analysis artifacts
figures/: saved plots
logs/: run logs

Notebooks

The notebooks/ directory contains exploratory or validation analyses that are not part of the production main.py stage graph. Examples include IHC/Xenium cell-type concordance analysis and core-specific Harmony diagnostics. Notebook-only helpers should stay local to notebooks unless they become reusable production logic.

Common failure modes

If a downstream stage fails because processed.h5ad is missing, include the ingest stage first.
If annotation fails, confirm that ollama serve is running and that llama3.1:8b was downloaded with ollama pull llama3.1:8b.
If a stage complains about a missing obs column such as leiden or cell_type, rerun the required upstream stage first.
If paths are wrong, update the run config passed to --config before rerunning.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
config.example.yaml		config.example.yaml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prostate Cancer Spatial Transcriptomics Pipeline

Prerequisites

Required model and server

Configure a run

Run the pipeline end-to-end

Run selected stages

Stage 0: ingest raw data

Stage 1: preprocess and cluster

Stage 2: annotate clusters

Stage 3: spatial domains

Stage 4: colocalization

Output layout

Notebooks

Common failure modes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prostate Cancer Spatial Transcriptomics Pipeline

Prerequisites

Required model and server

Configure a run

Run the pipeline end-to-end

Run selected stages

Stage 0: ingest raw data

Stage 1: preprocess and cluster

Stage 2: annotate clusters

Stage 3: spatial domains

Stage 4: colocalization

Output layout

Notebooks

Common failure modes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages