Add SLURM cluster support (cellpose 3.x backport)#1444
Open
karimi-ali wants to merge 2 commits into
Open
Conversation
…ackport) Mirror of the cellpose 4.x patch — same `slurmCluster` class, same `cluster_type` dispatch, same `resume_dir` mechanism, same memory-format split for dask vs SLURM, same bug fixes. The `slurmCluster` patch is intentionally cellpose-version-agnostic: it depends only on `dask_jobqueue.SLURMCluster`, not on cellpose internals. So the same `cellpose/contrib/distributed_segmentation.py` file works under both 3.x and 4.x once dropped over a 3.x install. Tested end-to-end on the MPCDF Raven cluster against a custom cellpose 3.x model (`CP_20250324_Nuc6`) on a 4518x5008x4560 uint16 X-ray nuclei volume. See companion PR against `main` (cellpose 4.x) for the full description.
(3.x backport — identical to the 4.x patch on slurm-distributed.) change_worker_attributes ======================== The previous implementation patched ``self.new_spec['options'][k] = v`` and called adapt(). On the GLC-07391_2 production run we observed via scontrol that newly spawned stitching jobs still carried the original GPU directives (cpu=18, mem=125000M, gres=gpu:a100:1) — the kwargs never made it onto the queued jobs. Two changes to make this reliable: * ``self.scale(0)`` is followed by ``self.sync(self._correct_state)`` so the cluster blocks until the existing GPU workers have actually left. Without this, adapt() can find a worker still in the spec and skip the respawn, leaving the run stuck against the original SLURM directives. * The kwargs are written into ``self._job_kwargs`` (the canonical store dask-jobqueue uses to render the job script) rather than just ``self.new_spec['options']``. We assert the two are still the same dict; if a future dask-jobqueue version breaks that invariant, the failure is loud rather than silent. The function now also prints the freshly-rendered SBATCH header for the next worker so the directives are visible in the driver log. merge_all_boxes =============== Was O(N) per unique id (per-id ``argwhere(==iii)``). For volumes with ~10^7 unique labels after stitching the quadratic blow-up wedged the final box-merge for hours. Replaced with a single ``argsort`` plus ``np.minimum/maximum.reduceat`` over (N, ndim) start/stop arrays — O(N log N * ndim). Verified bit-for-bit against the legacy implementation on synthetic inputs of (N=5e3, M=8e2) and (N=5e4, M=5e3); 0 mismatches in both regimes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add SLURM cluster support to
distributed_segmentation(3.x backport)Companion of the 4.x PR — see that one for the full description and
test context. This PR is the same
slurmClusterpatch applied tothe cellpose 3.x maintenance line.
Why a separate PR?
A user with a cellpose 3.x cyto-style model (
CP_20250324_Nuc6) hitthe same SLURM-cluster pain point as the 4.x users (issue #1111).
The patch is intentionally cellpose-version-agnostic — it only
touches
cellpose/contrib/distributed_segmentation.pyand dependson
dask_jobqueue.SLURMCluster, not on cellpose internals — so thesame file works on both 3.x and 4.x. Making this a separate
maintenance-branch PR keeps the 3.x line usable without forcing a
4.x upgrade for users with legacy models.
Files
cellpose/contrib/distributed_segmentation.py— same patch as the4.x PR.
What's in the patch (full description in the 4.x PR body):
slurmClusterclass +cluster_typedispatchgpus_per_jobmulti-GPU mode (withdask-cuda-workershim)resume_dirfor walltime-recoverable runs--meminteger)change_worker_attributesreliability fix(
scale(0)+sync(_correct_state),_job_kwargsdirect update,job-script preview log)
merge_all_boxesvectorized viaargsort+reduceat(was O(N) per group → O(N log N · ndim); fixed a stitching
wedge on volumes with ~10^7 unique labels). Correctness verified
on synthetic inputs (N=5e3 / N=5e4, 0 mismatches against the
legacy implementation) and end-to-end on an 8-block 512³
subvolume (51 356 final merged cells; same caveat about the
upstream
to_zarrstep being slow as in the 4.x PR).overlap = int(diameter * 2)andface.ndiminstead of hardcoded3D structuring element
No other files changed. Docs in
docs/distributed.rstupdated onlyin the 4.x PR (3.x docs may differ slightly; happy to mirror if the
maintainer wants).
Tested
End-to-end on MPCDF Raven cluster with cellpose 3.1.1.2 + the
CP_20250324_Nuc6custom nuclei model on a 4518×5008×4560 uint16X-ray volume.