-
Notifications
You must be signed in to change notification settings - Fork 18
memory error with large expected_groups #428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
👏🏾 👏🏾 I've been waiting for you ;) For dynamic discovery of groups at compute-time, you'd need (I welcome any suggestions for making this option easier to understand, or better documented. It is rarely needed). Though now that I think about it more, this is not going to work directly, it will accumulate to one single 50GB chunk at the end. What is Now for the more interesting part. I've been thinking about how to do this smarter, in particular, like this, since we can determine the bounding boxes for each polygon relatively cheaply from the vector geometries (it's incredibly expensive to do it from a 30m global raster). In that doc example, the "zone" raster is only 2GB so flox can scan it in memory and figure out how to do it. This giant-raster-zonal-stat problem is the only one I've seen that actually needs something a lot more complicated. I'm quite interested in working it out (hopefully just-in-time for Cloud Native Geospatial) So some questions:
|
Ah, so here are number of geometries at each admin level:
Which means at ADM2 level, the sparsity is (47217/263/83/854) ~ 0.002 . So another option here is to use sparse arrays for the intermediates. |
Thanks for the quick response and looking into this!:) Here is some more info:
Areas.band_data: ![]() All the groupers have the same shape as the areas dataset - all fit internal gridding system we use. tcl_data is just aligned form of the tcl_year btw. ![]()
We have a legacy pipeline that uses gdal_rasterize parallelized using AWS batch that does this. I had experimented with datashader for this a while ago which has dask array backend. Now that I use flox, I may revisit that.
We are doing the latter to report at every admin level. I tested out running without ADM_2 and that worked smoothly FYI. |
On #430 you should be able to
![]() it's a bit slow (I suspect #298 will improve things) but memory use seems low. I didn't run it for too long. Are you able to try it out? There's got to be a smarter way to do this starting with just the geometries. |
Forgot to respond, yes you'll need to provide |
Thanks for the updates, @dcherian. I tested it out and getting this error writing the results. As immediate solution, I can create a single gadm layer that encodes all three levels and see if that works.
|
This is fixed on the latest version of #430 . But yes using a single array with all 3 in there will work better. |
I just tagged v0.10.2 which should work for you. I wrote a quick doc page for this workload: https://flox.readthedocs.io/en/latest/user-stories/large-zonal-stats.html Let me know how it goes! |
Thanks for the user story docs and release, @dcherian! Looks like made it further with your latest fix but hitting this error now:
|
oops, thanks for testing it out. With #437 , I now test quite a few edge cases. You will need to also specify |
Great news @dcherian: this works now with all but the drivers grouper included! The density of the final sparse result is 0.00154 so the strategy is super efficient for this use case. There's something weird about the drivers layer (the size explodes when aligned, for example) and it's throwing the sparse error |
Hmmm this is probably floating point mismatch in the coordinate values. Using
That's weird. I just checked and I'm certainly running all the edge cases in the test suite with this option. |
The drivers data had same coordinates as areas but got cast to 2025-04-09T11_38_50.714Z_to_2025-04-09T11_50_52.764Z_tcl_dask_logs.csv.csv Memory error with all groupers:
|
Ah, so close! that was the very last step. Should be fixed in #440 I also fixed another case where the RangeIndex was getting realized in to memory, so you should see faster graph construction too I think.
Weird, that shouldn't affect things but seems like you don't see it any more.
You should be able to use a lot less memory now! |
This is great @dcherian! Confirming that it works with all the groupers now and didn't cost more than the case without the adm2 level in the older version. Thanks! |
🥳 thanks for the fun problem hehe. Are you able to tell me how long it took / how much peak memory it used and the cost please? I'm hoping you didn't have to run it with 64GB nodes. |
Yeah something is really weird. Why is it using so much memory in the beginning, for example. |
Hi, I'm testing out flox (with dask) as replacement for scala based zonal stats on global rasters (30m resolution) and getting promising results in performance with cleaner and much smaller code! However I'm running into this memory issue that I wanted to see if has cleaner solution than what I'm doing now.
Here is the simple code for calculating tree cover loss area at three political boundary levels, grouped by a total of six dask array layers with expected_groups sizes of (23, 5, 7, 248, 854, 86).
That runs into this error:
MemoryError: Unable to allocate 109 GiB for an array with shape (14662360160,) and data type int64
I'm getting around this by chunking the group-by layer with the highest unique labels (854) and doing the above in a dask delayed function over the chunks and concatenating the results.
This works but I may not be persisting some of these layers correctly for use by the expected_group chunks and runs slower than expected. Is there a more efficient and elegant way to handle this situation? It'd be great for example if this MultiIndex can be built dynamically in the workers with the available groups. Thanks.
Environment:
Flox: 0.10.1
Dask: 2025.2.0
Numpy: 2.2.3
Xarray: 2025.1.2
The text was updated successfully, but these errors were encountered: