-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for gldv2 and inaturalist datasets #241
Comments
Thanks for reaching out. Adding these datasets sounds interesting and happy to support you. FedJAX uses sqlite files instead of tfrecord files. One way to add support for these datasets would be as follows:
Some of the utilities in #216 might be useful. Please let us know if you have any questions. |
Hi @stheertha and thanks for the support! I went a different way for the moment (just to play around) by creating a I don't think that's the way to go, mainly because images have different shapes and I can't create the numpy objects this way. I'm still looking into it, but do you think that would be an issue with the |
Hi @marcociccone! I am not very familiar with these two datasets. By "images have different shapes" do you mean images in these datasets are not already transformed into a uniform height/width? How about images belonging to the same client? Can they be different in height/width too? In JAX in general, we need to keep the possible input shapes to a small set to avoid repetitive XLA compilations (each unique input shape configuration will require one XLA compilation), so padding or some other types of transformation is needed to ensure uniform input shapes. The main problem with deciding what to do with images in different shapes is first deciding how models consume them, so that we can choose an appropriate storage format (i.e. either padding or resizing). I am not very knowledgeable with image models. How do they deal with a batch of images that are in different shapes? Base on my limited understanding of a Conv layer, won't different input shapes lead to different output shapes after a Conv layer? What will an output layer do in that case? |
Hi @kho! By looking at images in the tfrecord of a randomly sampled client, I see that images have different height/width. This is a standard data augmentation practice when dealing with image datasets to increase the variability of the dataset. Do you think that doing something like that would be possible with the current fedjax data pipeline? |
Thanks for the clarification. This is supported by FedJAX pipeline and can be done in two ways: Option 1: When converting tfrecords to Option 2: When converting tfrecords to |
Thanks for your answer! I still need to check the codebase carefully but what if we create a |
Sorry about the confusion, I didn't know the datasets were this big (should have read the READMEs more carefully). Could you help me run some quick stats on gldv2? That will help me figure out if putting everything inside a SQLite database is feasible.
Regarding your proposal of wrapping I also have one question about how people outside Google usually work with such big datasets. Are the files actually stored on local disks, some NFS volume, or some other distributed file system (e.g. GCS or S3)? |
I think it would be great to port these datasets from tff to fedjax.
I would be happy to make the effort and contribute to the library, but I need a bit of support from the fedjax team 🙂
By looking at the tff codebase (gldv2, inaturalist) it looks that load_data_from_cache function creates a tfrecords file for each client.
The only concrete classes that I see are
SQLiteFederatedData
andInMemoryFederatedData
, but I don't think they are meant for this use case. What would be the best way to map the clients into aFederatedDataset
?We could replicate something like FilePerUserClientData.
Thanks!
The text was updated successfully, but these errors were encountered: