Skip to content

Behavior across env builds #46

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
e-marshall opened this issue Mar 28, 2025 · 13 comments
Closed

Behavior across env builds #46

e-marshall opened this issue Mar 28, 2025 · 13 comments

Comments

@e-marshall
Copy link
Owner

e-marshall commented Mar 28, 2025

Expanding on discussion from #44
Have tried running the book in 3 different envs:

  1. conda w/ env built from this book/tutorial_environment_book.yml
  2. conda w/ env built from this book/.binder/environment.yml (w/ some necessary additions like aiohttp, geoviews-core, xvec, pyarrow.parquet, haven't pushed these updates yet)
  3. pixi w/ env built from pixi.toml

Behavior of different envs

  • Currently, (1) is the only env that runs nbs successfully, this has xr pinned @ 2024.10.0
  • (3) fails/consumes a ton of memory trying to run xr.resample() operations in tutorial 1, nb 5. pixi.toml has xr 2025.1.2
  • I updated (2) to have xr pinned to 2024.10.0 however this still has problems with resample operations, so maybe source of the issues is something else?
@e-marshall
Copy link
Owner Author

e-marshall commented Mar 28, 2025

trying to get .binder/environment.yml to run the notebooks:

  • pin dask to match (1) (downgrade from 2025.3.0 to 2025.1.0)
    • still hangs on ds.resample()
  • the working env has numpy 1.24.3, other envs have numpy >=2.2.3,<3
    • downgrading -> no change in resample()
    • same for downgrading anyio 4.9.0 to 3.7.1
    • same for downgrading cf_xarray 0.10.4 to 0.10.0

Using env built from test_env.yml, notebook 5 works. the resample operation takse 1 min, zonal_stats takes 3 min and ~ 15 GB RAM.

  • xr=2024.10.0
  • this is running on pyarrow 14.0.1 & np 1.24.3

@dcherian
Copy link
Collaborator

I'm trying to reprodiuce with .binder/environment.yml. Seems to need requests, aiohttp too.

@e-marshall
Copy link
Owner Author

e-marshall commented Mar 29, 2025

Shoot sorry, I didn't push an updated file of that but I had to add aiohttp, requests, pyarrow.parquet, geoviews-core and xvec.
I was playing around with pinning a bunch of versions to match the working on (geospatial_datacube_book_env.yml in the book dir) and adding numpy_groupies but I haven't gotten it to reproduce the behavior of that env yet. The closest I've gotten is downgrading py arrow and numpy to numpy <2
^ this runs but it takes a few mins at 16 gb ram

@dcherian
Copy link
Collaborator

Opened dask/dask#11853

@dcherian
Copy link
Collaborator

I don't understand how this ever worked but if there's an env that works, can we just move forward with that?

@e-marshall
Copy link
Owner Author

e-marshall commented Mar 29, 2025

i'm confused on this too; it seems to work for envs built from https://github.com/e-marshall/cloud-open-source-geospatial-datacube-workflows/blob/main/book/geospatial_datacube_tutorial_env.yaml and https://github.com/e-marshall/cloud-open-source-geospatial-datacube-workflows/blob/main/book/tutorial_environment_book.yaml but these are messes of env files.
these operations are happening lazily in that env but not in the other

Image

Image

i've tried to create fresh envs matching the version of relevant pkgs from those files and haven't been able to produce the 'working' behavior. i'll keep troubleshooting this tomorrow

@dcherian
Copy link
Collaborator

very weird, another pragmatic approach would be to subset to the variables that you need for the notebook. THat may help

@e-marshall
Copy link
Owner Author

^ that sounds good, i didn't get a chance to work on this today but i'll try to tomorrow

@scottyhq
Copy link
Collaborator

So if I'm following the discussion and linked dask issue correctly, it seems that:

  1. it's critical to have flox in the environment or else the resampling task graphs are really bad
  2. performance is suffering for dask>=2024.12.0 for the resampling.

With these changes notebook 5 is running well for me!

Apologies for missing flox originally when copying the old environments over here @e-marshall! I'll open a PR with an update.

I believe notebook 4 could still require significantly less RAM in a few ways: Just removing dask completely :)(#38 (comment)), subset to only the variables you're working with, or reduce figure resolution.

@dcherian
Copy link
Collaborator

yeah if its a 3GB raster, then just load it. I thought it was much bigger though?

Image

@scottyhq
Copy link
Collaborator

Apologies for missing flox originally when copying the old environments over here @e-marshall! I'll open a PR with an update.

🤦Sorry @e-marshall I just pushed to main accidentally, I thought I was on a fork/separate branch 8cdb6af . I can revert it but maybe it's OK? Since this is nearing completion might be time to protect the main branch https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/managing-a-branch-protection-rule (require pull request)

I thought it was much bigger though?

Notebook 4 starts with a compressed subset of the full version: https://e-marshall.github.io/cloud-open-source-geospatial-datacube-workflows/itslive/nbs/4_exploratory_data_analysis_single.html#load-raster-data-and-visualize-with-vector-data

@e-marshall
Copy link
Owner Author

e-marshall commented Mar 31, 2025

thanks @scottyhq !! I had missed flox too when trying to recreate it! thanks for the tips re nb 4. friday i played around with saving the figure to file rather than showing the figure output and clearing the variables, i'll also make sure to subset to just a few variables / remove dask. thank you both for your help with this!
ETA the commit to main is not a problem! Thanks so much for those additions, i'll look into protecting main too

@e-marshall
Copy link
Owner Author

nb4 ram should be fixed in #49 which removes dask and unnecessary variables. using 'jupyter-resource-usage' extension looks like this should now use < 5 gb ram. going to close this issue for now unless something else comes up. thank you all for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants