-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register WB2 data on HF datasets? #226
Comments
Hey Alex, |
HuggingFace unfortunately doesn't work super well for Zarr, at least storing the data on HF in Zarr is a pain for any large amount of data/chunks, as the git system breaks, and they have a 50GB file size limit/the large numbers of files that Zarr stores causes issues. But I think just having a small wrapper to load from GCS would probably work well. |
Thanks Jacob, it's good to hear about your experience with distributed data there. Yeah, I wonder if we can work with HF for them to support "remote" dataset stores. I think there would be a community benefit in distribution of the data there. |
@raspstephan, thanks, I think you're right that ARCO-ERA5 would probably be better to host on the datasets section of HF. But, if we can figure out the data loading, I wonder if it's possible to integrate this benchmark with their platform/community for benchmarks. |
Hi alxmrs, |
Hey tonyzyl, that's an interesting point that I didn't consider. To be listed on HF means integration with their tfds-based data loader. It would imply specifying an ML-focused access pattern for a training pipeline. On a related note Tony, have you tried out xbatcher? I think it would work for you better than using vanilla Python multiprocessing. |
Hi alxmrs, thanks for the comment. I think it was the lazy-loading of zarr (if zarr is loaded in memory, it is fine) map-style dataset that cause my training script to hang. Yes, I did use the xbatcher in my previous script and under the hood I think it is still a mapstyle dataset https://github.com/xarray-contrib/xbatcher/blob/a144c257fc6dbe116c882e3b7f5490e21ca45c79/xbatcher/generators.py#L382. Regarding the integration, hope these help:
For my parallel training, the data I/O seems to be the bottleneck so I switch to iterabledatset when I only need one snapshot. But I haven't figured out an appropriate iterabledataset format for constructing time-series. |
It looks like huggingface supports Zarr in the datasets module: huggingface/datasets#4096. (TBD how well, but hey -- the issue is closed!)
I think WB2's data could have father reach (e.g. among traditional ML developers) if it were registered on huggingface.
The text was updated successfully, but these errors were encountered: