Skip to content

Add SLURM cluster support (cellpose 3.x backport)#1444

Open
karimi-ali wants to merge 2 commits into
MouseLand:mainfrom
karimi-ali:slurm-distributed-v3
Open

Add SLURM cluster support (cellpose 3.x backport)#1444
karimi-ali wants to merge 2 commits into
MouseLand:mainfrom
karimi-ali:slurm-distributed-v3

Conversation

@karimi-ali
Copy link
Copy Markdown

Add SLURM cluster support to distributed_segmentation (3.x backport)

Companion of the 4.x PR — see that one for the full description and
test context. This PR is the same slurmCluster patch applied to
the cellpose 3.x maintenance line.

Why a separate PR?

A user with a cellpose 3.x cyto-style model (CP_20250324_Nuc6) hit
the same SLURM-cluster pain point as the 4.x users (issue #1111).
The patch is intentionally cellpose-version-agnostic — it only
touches cellpose/contrib/distributed_segmentation.py and depends
on dask_jobqueue.SLURMCluster, not on cellpose internals — so the
same file works on both 3.x and 4.x. Making this a separate
maintenance-branch PR keeps the 3.x line usable without forcing a
4.x upgrade for users with legacy models.

Files

  • cellpose/contrib/distributed_segmentation.py — same patch as the
    4.x PR.

What's in the patch (full description in the 4.x PR body):

  • slurmCluster class + cluster_type dispatch
  • gpus_per_job multi-GPU mode (with dask-cuda-worker shim)
  • resume_dir for walltime-recoverable runs
  • Memory-format split (dask "MB" string vs SLURM --mem integer)
  • SLURM-aware "release GPUs for stitching" branch
  • change_worker_attributes reliability fix
    (scale(0) + sync(_correct_state), _job_kwargs direct update,
    job-script preview log)
  • merge_all_boxes vectorized via argsort + reduceat
    (was O(N) per group → O(N log N · ndim); fixed a stitching
    wedge on volumes with ~10^7 unique labels). Correctness verified
    on synthetic inputs (N=5e3 / N=5e4, 0 mismatches against the
    legacy implementation) and end-to-end on an 8-block 512³
    subvolume (51 356 final merged cells; same caveat about the
    upstream to_zarr step being slow as in the 4.x PR).
  • overlap = int(diameter * 2) and face.ndim instead of hardcoded
    3D structuring element

No other files changed. Docs in docs/distributed.rst updated only
in the 4.x PR (3.x docs may differ slightly; happy to mirror if the
maintainer wants).

Tested

End-to-end on MPCDF Raven cluster with cellpose 3.1.1.2 + the
CP_20250324_Nuc6 custom nuclei model on a 4518×5008×4560 uint16
X-ray volume.

karimi-ali and others added 2 commits May 1, 2026 06:41
…ackport)

Mirror of the cellpose 4.x patch — same `slurmCluster` class, same
`cluster_type` dispatch, same `resume_dir` mechanism, same memory-format
split for dask vs SLURM, same bug fixes.

The `slurmCluster` patch is intentionally cellpose-version-agnostic:
it depends only on `dask_jobqueue.SLURMCluster`, not on cellpose
internals. So the same `cellpose/contrib/distributed_segmentation.py`
file works under both 3.x and 4.x once dropped over a 3.x install.

Tested end-to-end on the MPCDF Raven cluster against a custom cellpose
3.x model (`CP_20250324_Nuc6`) on a 4518x5008x4560 uint16 X-ray nuclei
volume.

See companion PR against `main` (cellpose 4.x) for the full description.
(3.x backport — identical to the 4.x patch on slurm-distributed.)

change_worker_attributes
========================
The previous implementation patched ``self.new_spec['options'][k] = v``
and called adapt(). On the GLC-07391_2 production run we observed via
scontrol that newly spawned stitching jobs still carried the original
GPU directives (cpu=18, mem=125000M, gres=gpu:a100:1) — the kwargs
never made it onto the queued jobs.

Two changes to make this reliable:

* ``self.scale(0)`` is followed by ``self.sync(self._correct_state)``
  so the cluster blocks until the existing GPU workers have actually
  left. Without this, adapt() can find a worker still in the spec and
  skip the respawn, leaving the run stuck against the original SLURM
  directives.
* The kwargs are written into ``self._job_kwargs`` (the canonical
  store dask-jobqueue uses to render the job script) rather than just
  ``self.new_spec['options']``. We assert the two are still the same
  dict; if a future dask-jobqueue version breaks that invariant, the
  failure is loud rather than silent.

The function now also prints the freshly-rendered SBATCH header for
the next worker so the directives are visible in the driver log.

merge_all_boxes
===============
Was O(N) per unique id (per-id ``argwhere(==iii)``). For volumes with
~10^7 unique labels after stitching the quadratic blow-up wedged the
final box-merge for hours. Replaced with a single ``argsort`` plus
``np.minimum/maximum.reduceat`` over (N, ndim) start/stop arrays —
O(N log N * ndim).

Verified bit-for-bit against the legacy implementation on synthetic
inputs of (N=5e3, M=8e2) and (N=5e4, M=5e3); 0 mismatches in both
regimes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant