Skip to content

Writing edits #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 28, 2025
Merged

Writing edits #38

merged 21 commits into from
Mar 28, 2025

Conversation

e-marshall
Copy link
Owner

Updates to background and conclusion sections

@e-marshall e-marshall mentioned this pull request Mar 25, 2025
5 tasks
book/_config.yml Outdated
@@ -30,7 +30,7 @@ bibtex_bibfiles:
repository:
url: https://github.com/e-marshall/cloud-open-source-geospatial-datacube-workflows
#url: https://e-marshall.github.io/cloud-open-source-geospatial-datacube-workflows/introduction.html # Online location of your book
branch: main # Which branch of the repository should be used when creating links (optional)
branch: writing_edits # Which branch of the repository should be used when creating links (optional)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple nit-picky suggestions to deal with jupyter book warnings coming from pixi run build

  1. These are easy, just need to be sure to go from # to ## etc
    /tmp/cloud-open-source-geospatial-datacube-workflows/book/background/context_motivation.md:5: WARNING: Non-consecutive header level increase; H1 to H3 [myst.header]

  2. WARNING: skipping unknown output mime type: application/vnd.holoviews_load.v0+json [mystnb.unknown_mime_type]

You can add a bit to this _config.yml to suppress these:

sphinx:
  config:
    # application/vnd.holoviews_load.v0+json, application/vnd.holoviews_exec.v0+json
    suppress_warnings: ["mystnb.unknown_mime_type"]
  1. /tmp/cloud-open-source-geospatial-datacube-workflows/book/intro/open_source_setting.md:17: WARNING: 'image': Unknown option keys: ['right-align'] (allowed: ['align', 'alt', 'class', 'height', 'loading', 'name', 'scale', 'target', 'width']) [myst.directive_option]

not sure!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @scottyhq ! i didn't get these running jupyter-book build, i'll switch over to using pixi run build and troubleshoot

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory it should be the same (https://github.com/e-marshall/cloud-open-source-geospatial-datacube-workflows/blob/dc5defdbc070d1f7410260d60172ce1fb8d793c9/pixi.toml#L48) perhaps there is a difference in versions.

FYI I'm running the notebooks on https://hub.cryointhecloud.com . The one additional step required to get the notebooks to find the right environment was to run (name can be whatever you want):

pixi shell 
python -m ipykernel install --user --name=datacube

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo ok, thanks thats really helpful. do you think it would be worth adding a note about this/running on jupyter hubs in the software page ? i'm happy to do that later this evening if so

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be great to have as an example of running code next to the data in AWS us-west-2. Of course not everyone has access to cryocloud and who knows how long each jupyterhub lasts, but there are at least a few that one could get access to:
https://docs.openveda.cloud/user-guide/scientific-computing/
https://opensarlab-docs.asf.alaska.edu

Copy link
Collaborator

@scottyhq scottyhq Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -m ipykernel install --user --name=datacube

I realized this doesn't quite work because it doesn't "activate" the environment (e.g. environment variables like PROJ_DATA don't get set in the same way they do with conda activate or pixi shell)! I'll write up a quick guide in a separate PR: #41

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! You could also consider linking to it on the front page of the tutorial website (I think it's nice to have a quick visual for visitors to gauge what the website covers).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of adding an image to the front page!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this section necessary / beyond scope? Felt like there was some justification for open source needed but maybe not. i'd be ok either way between leaving this in and polishing it a bit or taking out

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep it concise, I think you could remove the first paragraph. For refs this is one that I personally found inspirational when it came out https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2015EA000136

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds great to me, thanks for the pointer to the paper! I remember reading it a while back and liking a lot, I think it got lost in the rewrite so i'll definitely add it back in and use in this section

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying this with the timeseries_type = "subset" . Seems like the data directory subfolders need to be created in the create_vrt_object() function (e.g. pathlib.Path('../data/raster_data/subset_timeseries/txt_files').mkdir(parents=True, exist_ok=True))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 'subset' be the default? instead of 'full' ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoot i'll take a look at the subdirs issue
For the default type, i had it set to 'full' because the markdown text in the notebooks is from the perspective of working with the full time series, but do you think we should change it to subset since that's the most accessible version? it isn't an issue in nb 1 but it would require some changes to nb3

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be addressed on b44b91f

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the default type, i had it set to 'full' because the markdown text in the notebooks is from the perspective of working with the full time series,

OK, makes sense. Perhaps just add a note that download speeds from zenodo can be highly variable (xref https://discourse.pangeo.io/t/downloading-speed-from-zenodo/3529). For me on cryocloud running the cell to download the full zip is taking 1hr+ !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick datapoints:

Home wifi in Germany (probably actually faster than hubs b/c Zenodo/CERN is based in Switzerland (https://about.zenodo.org/infrastructure)

asf_rtcs_full.zip 0%[ ] 310.27M 6.33MB/s eta 2h 8m

CryoCloud (1st download finished in 1 hr, testing again a bit later is super slow)

asf_rtcs_full.zip 0%[ ] 13.03M 677KB/s eta 22h 9m

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks i will add a warning. it takes ~ 1 hr for me to download to big time series

Copy link
Collaborator

@scottyhq scottyhq Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that I needed at least 17GB of RAM to successfully run this notebook.
image

I haven't dug into why, but for some reason opening the zarr via dask is doubling the memory 3.3 -> 6.6GB in the initial loading step... but since it's only a 3GB memory and you immediate do compute() an alternative is to just remove dask entirely from this notebook (chunks=None on load):

# Read raster
single_glacier_raster = xr.open_zarr("../data/raster_data/single_glacier_itslive.zarr", 
                                     decode_coords="all",
                                     chunks=None,
                                     )

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on reducing memory usage here. i think its a few of the plots causing the issue. considering saving the plots to file and inserting them as fig so that the objects can be deleted in memory... will add in separate pr

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this section :)

Copy link
Collaborator

@scottyhq scottyhq Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to execute this notebook w/ 30GB RAM CryoCloud instance. Seems to be crashing here:

vector_data_cube = dc_resamp.xvec.zonal_stats(
    rgi_subset.geometry,
    x_coords="x",
    y_coords="y",
).drop_vars("index")

Copy link
Owner Author

@e-marshall e-marshall Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, i'm not sure what's going on. on my computer it is ~1.5 GB. will do some digging...

Copy link
Owner Author

@e-marshall e-marshall Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think i know what may have happened. creating the vector cube takes a ton of memory, i ran it once then wrote it to disk to hopefully just use that one that's read from disk (its ~ 4mb) rather than the one made in the nb, but i accidentally added it to the gitignore file. i think i need to figure out a way for it to be included in the repo but not tracked? will work on this but i think will do it in its own pr so will merge this then address that

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my comment above I was thinking of a different operation, the one failing for you happens before reading the vector data cube from memory, my bad. I'm not sure why the discrepancy but it runs for me on less than 2gb ram, I'll try to figure it out tomorrow

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe the difference in behavior noted here is related to an env created w/ conda and an env created w/ pixi. when i run this in a conda env, the whole notebook takes up less than 2gb ram. in a pixi env, the resample step in the first section of the notebook dc_resamp = dc.resample(mid_date="3ME").mean() uses > 20GB. there is an xarray version difference btw the two (conda env defaults to xr 2024.10 and pixi 2025.1.2) also mentioned here.

does it make sense that this large a change in behavior could be due to the xr version?

@e-marshall e-marshall merged commit 3d889e7 into main Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants