Proof-of-concept: Cellpose distributed on Slurm cluster with AMD GPUs#1334
Open
erjel wants to merge 15 commits intoMouseLand:mainfrom
Open
Proof-of-concept: Cellpose distributed on Slurm cluster with AMD GPUs#1334erjel wants to merge 15 commits intoMouseLand:mainfrom
erjel wants to merge 15 commits intoMouseLand:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1334 +/- ##
==========================================
+ Coverage 42.19% 42.29% +0.09%
==========================================
Files 16 16
Lines 3773 3783 +10
==========================================
+ Hits 1592 1600 +8
- Misses 2181 2183 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
Author
|
In the current form, Looking forward to feedback! |
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi,
for a project of mine I needed to scale cellpose on a SLURM cluster. To make the topic a little more interesting, the cluster I have at hand has only AMD GPUs. The documentation on distributed cellpose gave hints on how to run on LSF clusters. I also want to mention that there is already some documentation on how to run cellpose on AMD GPUs.
The first contribution of this PR is a working conda environment (
environment-rocm.yaml) file which works for inference on AMD GPUs. I am happy to update the install documentation accordingly.The second contribution is a medium-sized test case for a slurm cluster (
cellpose/contrib/test_slurm.pycellpose/contrib/cluster_script.py).The example data is not special by any means - and not working particularly well with cellposeSAM, if someone has a hint on a nice (1024 x 1024 x 1024 px ) dataset which is worth highlighting in the cellpose distributed documentation I am open for suggestions.My hope is that the test can be serve as reference for checking cellposes distributed on different clusters before users try to run cellpose with their own data.Lastly, I modified
cellpose/contrib/distributed_segmentation.pyso that it now works for my circumstances. Note that there two things left to be done:1. the code still needs some clean-up after my initial tests with cropping/ transposing2. the PR will in its current form break the functionality of the
janeliaLSFClusterclass due to missing abstraction indistributed_evalwith respect to themem,cores, andncpus.3. Scaling the cluster to 0 workers; changing the worker config and rescaling did not work for me. I am happy to run further tests, but I would need some assistance with dask debugging.
I am happy to polish the code and documentation the next days. Since I am not really a dask expert I am very curious about feedback about my dask usage.
Best wishes,
Eric
fixes #1111