Skip to content

Non-adherence to resource parameters #3873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fruce-ki opened this issue Apr 18, 2025 · 11 comments
Open

Non-adherence to resource parameters #3873

fruce-ki opened this issue Apr 18, 2025 · 11 comments
Labels
concurrency Related to parallel processing

Comments

@fruce-ki
Copy link

It seems that the parameters for multithreading and RAM are not obeyed throughout.

The screenshot of my resource monitor is from a sorting run with global_job_kwargs = dict(n_jobs = 1, mp_context = 'spawn', chunk_memory = '10G', total_memory = None) as overall settings, and 'n_jobs' : 1 and 'memory_limit': 0.5 in the SC2 settings.

Image

In a workstation with 20 physical Intel cores and >300G RAM, this run should have barely been visible on the resource monitor, if the resource restrictions were respected. But apparently some stages of the execution ask for and receive 100% of the resources. The shot is during SC2, but I get similar situations with the preprocessing and postprocessing modules.

I've tried 'memory_limit': 0.1 with not much difference: Right now I'm sitting at 80% RAM during find spikes (circus-omp-svd) (no parallelization):, when the allowed RAM should be just 10% if I understand correctly.

This is quite a problem for me. On one hand, this often oversubscribes resources and I come back to a killed terminal session, making it impossible to reliably queue processing of many recordings overnight or over the over the weekend. On the other hand it also prevents me from running any other resource intensive tasks while processing a recording, as I know that those resources will at some point become unavailable and cause stuff to crash.

Am I doing something wrong with my parameters? Is there a way to keep SI from taking up all the resources and instead take up only a predictable amount that would allow me to run other stuff in parallel with this?

I'm on version 0.102.2

@zm711
Copy link
Member

zm711 commented Apr 18, 2025

I think part of this problem is we pass off some of the code to sklearn, scipy, and numpy. We can't control their resource utilization. So in the past we've had some issues with certain parts of our library that rely on the other libraries switching to much higher utilization. Also note that n_jobs is actually the number of workers or processes and so the scheduler in your OS can choose to share that among many cores or do just one core (for example on Windows I often set for one process but end up using all my cores because Windows will pass a single process to all the cores). Our RAM limits are a little out of my expertise--sorry maybe @samuelgarcia or @alejoe91 can comment there.

Which functions are do you use? And do any respect the limits? I know SC2 uses a mix of other libraries, but some of our preprocessing only use numpy which I think doesn't override our limits the way sklearn and scipy can.

@zm711 zm711 added the concurrency Related to parallel processing label Apr 18, 2025
@chrishalcrow
Copy link
Member

Hi @fruce-ki , I also had some problems with the circus-omp-svd spike finding method too, so maybe there's a memory issue there (note: SC2 isn't published yet, so I'd treat it as an experimental sorter). I had better luck using the "wobble" method, which can be used by passing the matching method wobble, like so:

sorting = si.run_sorter(
    recording=my_recording, 
    sorter_name="spykingcircus2", 
    matching = {"method": "wobble"}
)

I'd be interested to know if this improves the memory use. Let me know :)

@h-mayorquin
Copy link
Collaborator

I don't know how the memory_limit works on SC2, but we might have an understable misunderstanding about job_kwargs.

The memory parameters in job_kwargs (such as chunking, memory, etc.) refer to the size of the chunks: the unit of processing the pipeline will work with. These are not hard limits or memory caps; not even the post-processing within our library strictly respects them. And as others have mentioned, we don’t control the algorithms used by other libraries and their memory costs.

Maybe we should but we don't have the resources at the moment to promise such guarantees.

@fruce-ki
Copy link
Author

Thank you all for the responses.

I have no hands-on experience with sklearn/scipy/numpy etc, but I find it surprising that such important and popular python libraries offer no way to control their resources performance. In any case, it is understandable that you don't control everything, but then these parameters are misleading and need to be documented a lot more explicitly. I've spend an absurd amount of time trying to work through/around their behaviour trying to achieve high throughput and I am still unsure how to do it reliably.

I know chunk size isn't super-accurately adhered to, and there is the overhead of all the stuff being computed for that chunk, but it is an extremely long way from 10G to 350G. Would it make any difference if chunk size was 1G or 500MB, does SC2 even work with chunks? I know it doesn't use n_jobs from the job_kwargs, so maybe its memory parameters are also separate.

With regards to SC2 being experimental, while I am ware it is a work in progress, I was told off here a few months ago for using SC1. So this is giving mixed signals about which tool is the recommended one to use... Should I switch back to SC1?

I will give wobble a try.

I don't have a function-by-function breakdown of what behaves and what doesn't, as I am aiming to process a high number of recordings, so I am not working interactively. I'd love to break it down more precisely for you if it meant you were going to address it, but you already say that you can't.

Pre-processing usually has high resource peaks at the beginning of a recording (maybe while loading the data?) and then comes down significantly. I think cpu comes down all the way, RAM stays higher than I'd have expected, but that's probably because I was expecting the parameter to control the whole processing volume, not only the input data size.

@yger
Copy link
Collaborator

yger commented Apr 22, 2025

Let me jump in the discussion. This is true that SC2 is still evolving (we are starting the paper, so I swear it will settle once for good soon), but the reason I was encouraging the switch is that SC1 is not maintained anymore and is likely to be outdated and hard to install with recent versions of libraries.

Regarding the memory usage, lots of progress have been made in SC2 recently for high density counts MEA (such as 4096) in order to make the sorter work with reduced memory usage. Now, this should be the case with the main branch of spikeinterface. The memory_usage that i've introduced are internal to SC2 and indeed were made to tackle some of the issues (I can detail more, but this might be a bit technical). So to andwer your questions:

  • sc2 works with chunks, and the results should not depend (that much) on chunk sizes. I might, because clustering depends on a subset of found peaks, and there will be differences if chunks have different sizes
  • I would still encourage to use sc2
  • what is your probe layout, i.e. how many channels are we taking about? Because such a large memory usage is not normal, and is likely to be the results of a bug somewhere. You only see such memory peak while finding spikes? How many templates have been found?

Please upgrade to latest main branch of spikeinterface, and let's solve this properly

@fruce-ki
Copy link
Author

Hi Pierre!

My probes are MaxTwo wells, with up to 1020 active electrodes in each recording.

@yger
Copy link
Collaborator

yger commented Apr 22, 2025

Yes, such recordings were problematic with former versions of SC2 because when used with lots of cores, the estimate_templates() function called internally during the clustering was preallocating massive amount of RAM. This has been fixed, you should give it a try. Either by reducing the amount of cores used ONLY during this estimation (and this is what memory_usage was for in #3721) or even better, by skipping this estimation from raw data and instead infer templates from SVD values (already in ram). This should now be the default in main with the option templates_from_svd

@fruce-ki
Copy link
Author

Sounds promising!
I'll give it a try as soon as the workstation is free from the current non-MEA workload I'm running.

@zm711
Copy link
Member

zm711 commented Apr 22, 2025

I also just want to make clear that this isn't an easy problem to solve see here about kernel scheduling in linux. We can tell python to request something, but the kernel of your OS makes the ultimate decision. For Windows that often means bouncing a process between CPUs even if that doesn't make sense. And we could tell python to request xx GB of RAM, but maybe the OS kernel gives some degree of extra memory. What we can try to fix are memory leaks with our programs and try to make the best possible requests but as OSs change we often see even the biggest libraries (like scipy or sklearn) have to deal with spiking CPU and RAM usage. I think at best we could add a note in our documentation to make it clearer that we try to add limits but that the OS/python often does override the limitations we put in.

@fruce-ki
Copy link
Author

fruce-ki commented Apr 23, 2025

I switched to wobble, and updated my script to the new parameters for the current main branch.

The memory footprint is looking much better!

The CPU spikes still exist during the following steps:
noise_level (no parallelization) (went to 50%),
write_memory_recording (no parallelization) (went to 100%),
after split_clusters with local_feature_clustering: 100% (went to 100%),
at the beginning of find spikes (wobble) (no parallelization) (went to 100%)and at each increment of its progress indicator. All these steps were fairly short-lived, so maybe they wouldn't cause too much problems at runtime.

No major cpu peaks in basic postprocessing. In pre-processing, after a very short-lived initial peak, presumably during data loading, it stabilized to where it should be. No memory peaks during pre or post.

@chrishalcrow
Copy link
Member

Great to hear it's working better @fruce-ki !

Just to clarify that I'm not against SC2 in any way - actually, I use it and think the results are great! I more meant that because it's evolving, we users might need to spend a bit more time poking and adjusting it for our set-up compared to something more mature and stable.

Also, to give an idea of how complex memory/multi-processing is in python: have a look at the sklearn page about parallelism https://scikit-learn.org/stable/computing/parallelism.html it's very complex!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
concurrency Related to parallel processing
Projects
None yet
Development

No branches or pull requests

5 participants