Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register WB2 data on HF datasets? #226

Open
alxmrs opened this issue Jan 19, 2025 · 7 comments
Open

Register WB2 data on HF datasets? #226

alxmrs opened this issue Jan 19, 2025 · 7 comments

Comments

@alxmrs
Copy link
Collaborator

alxmrs commented Jan 19, 2025

It looks like huggingface supports Zarr in the datasets module: huggingface/datasets#4096. (TBD how well, but hey -- the issue is closed!)

I think WB2's data could have father reach (e.g. among traditional ML developers) if it were registered on huggingface.

@raspstephan
Copy link
Collaborator

Hey Alex,
great to hear from you. What datasets do you mean specifically? ERA5 absolutely but as I can see you already mentioned ARCO-ERA5 which is probably the canonical ERA5 version.

@jacobbieker
Copy link

HuggingFace unfortunately doesn't work super well for Zarr, at least storing the data on HF in Zarr is a pain for any large amount of data/chunks, as the git system breaks, and they have a 50GB file size limit/the large numbers of files that Zarr stores causes issues. But I think just having a small wrapper to load from GCS would probably work well.

@alxmrs
Copy link
Collaborator Author

alxmrs commented Jan 23, 2025

Thanks Jacob, it's good to hear about your experience with distributed data there. Yeah, I wonder if we can work with HF for them to support "remote" dataset stores. I think there would be a community benefit in distribution of the data there.

@alxmrs
Copy link
Collaborator Author

alxmrs commented Jan 23, 2025

@raspstephan, thanks, I think you're right that ARCO-ERA5 would probably be better to host on the datasets section of HF. But, if we can figure out the data loading, I wonder if it's possible to integrate this benchmark with their platform/community for benchmarks.

@tonyzyl
Copy link

tonyzyl commented Jan 25, 2025

Hi alxmrs,
After a lot of trials of errors, I figured rechunking (yes, I didn't know about ARCO at that time) into webdataset format and make use of the HF dataset's iterabledataset is the most efficient way that I have tried so far. When doing a map-style lazy loading from xarray, I found multiprocessing will cause my training script to hang.

@alxmrs
Copy link
Collaborator Author

alxmrs commented Jan 25, 2025

Hey tonyzyl, that's an interesting point that I didn't consider. To be listed on HF means integration with their tfds-based data loader. It would imply specifying an ML-focused access pattern for a training pipeline.

On a related note Tony, have you tried out xbatcher? I think it would work for you better than using vanilla Python multiprocessing.

https://xbatcher.readthedocs.io/en/latest/

@tonyzyl
Copy link

tonyzyl commented Jan 25, 2025

Hi alxmrs, thanks for the comment. I think it was the lazy-loading of zarr (if zarr is loaded in memory, it is fine) map-style dataset that cause my training script to hang. Yes, I did use the xbatcher in my previous script and under the hood I think it is still a mapstyle dataset https://github.com/xarray-contrib/xbatcher/blob/a144c257fc6dbe116c882e3b7f5490e21ca45c79/xbatcher/generators.py#L382.

Regarding the integration, hope these help:

For my parallel training, the data I/O seems to be the bottleneck so I switch to iterabledatset when I only need one snapshot. But I haven't figured out an appropriate iterabledataset format for constructing time-series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants