Register WB2 data on HF datasets? #226

alxmrs · 2025-01-19T04:21:18Z

It looks like huggingface supports Zarr in the datasets module: huggingface/datasets#4096. (TBD how well, but hey -- the issue is closed!)

I think WB2's data could have father reach (e.g. among traditional ML developers) if it were registered on huggingface.

raspstephan · 2025-01-19T09:52:29Z

Hey Alex,
great to hear from you. What datasets do you mean specifically? ERA5 absolutely but as I can see you already mentioned ARCO-ERA5 which is probably the canonical ERA5 version.

jacobbieker · 2025-01-22T09:55:31Z

HuggingFace unfortunately doesn't work super well for Zarr, at least storing the data on HF in Zarr is a pain for any large amount of data/chunks, as the git system breaks, and they have a 50GB file size limit/the large numbers of files that Zarr stores causes issues. But I think just having a small wrapper to load from GCS would probably work well.

alxmrs · 2025-01-23T21:09:32Z

Thanks Jacob, it's good to hear about your experience with distributed data there. Yeah, I wonder if we can work with HF for them to support "remote" dataset stores. I think there would be a community benefit in distribution of the data there.

alxmrs · 2025-01-23T21:11:31Z

@raspstephan, thanks, I think you're right that ARCO-ERA5 would probably be better to host on the datasets section of HF. But, if we can figure out the data loading, I wonder if it's possible to integrate this benchmark with their platform/community for benchmarks.

tonyzyl · 2025-01-25T03:35:19Z

Hi alxmrs,
After a lot of trials of errors, I figured rechunking (yes, I didn't know about ARCO at that time) into webdataset format and make use of the HF dataset's iterabledataset is the most efficient way that I have tried so far. When doing a map-style lazy loading from xarray, I found multiprocessing will cause my training script to hang.

alxmrs · 2025-01-25T04:03:07Z

Hey tonyzyl, that's an interesting point that I didn't consider. To be listed on HF means integration with their tfds-based data loader. It would imply specifying an ML-focused access pattern for a training pipeline.

On a related note Tony, have you tried out xbatcher? I think it would work for you better than using vanilla Python multiprocessing.

https://xbatcher.readthedocs.io/en/latest/

tonyzyl · 2025-01-25T13:50:45Z

Hi alxmrs, thanks for the comment. I think it was the lazy-loading of zarr (if zarr is loaded in memory, it is fine) map-style dataset that cause my training script to hang. Yes, I did use the xbatcher in my previous script and under the hood I think it is still a mapstyle dataset https://github.com/xarray-contrib/xbatcher/blob/a144c257fc6dbe116c882e3b7f5490e21ca45c79/xbatcher/generators.py#L382.

Regarding the integration, hope these help:

For my parallel training, the data I/O seems to be the bottleneck so I switch to iterabledatset when I only need one snapshot. But I haven't figured out an appropriate iterabledataset format for constructing time-series.

alxmrs mentioned this issue Jan 19, 2025

Register ARCO-ERA5/ar on HF datasets? google-research/arco-era5#94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register WB2 data on HF datasets? #226

Register WB2 data on HF datasets? #226

alxmrs commented Jan 19, 2025

raspstephan commented Jan 19, 2025

jacobbieker commented Jan 22, 2025

alxmrs commented Jan 23, 2025

alxmrs commented Jan 23, 2025

tonyzyl commented Jan 25, 2025

alxmrs commented Jan 25, 2025

tonyzyl commented Jan 25, 2025

Register WB2 data on HF datasets? #226

Register WB2 data on HF datasets? #226

Comments

alxmrs commented Jan 19, 2025

raspstephan commented Jan 19, 2025

jacobbieker commented Jan 22, 2025

alxmrs commented Jan 23, 2025

alxmrs commented Jan 23, 2025

tonyzyl commented Jan 25, 2025

alxmrs commented Jan 25, 2025

tonyzyl commented Jan 25, 2025