-
-
Notifications
You must be signed in to change notification settings - Fork 368
Added Store.getsize #2426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Store.getsize #2426
Conversation
d-v-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good, and I like the safe, slow default over returning -1
|
I might be confusing myself, but I think this implementation might not be what we want... I think what users want (like us in #2400) is the size of an Array in storage, not the size of a particular key. I guess we could do something like generate all the keys for a given array and then call So maybe we do need this, since the store is knows (or can figure out) what bytes are actually stored for a given array. But we also need a bit on top of it to bring it to the array level. |
In case you want to go this direction, this method is designed for exactly such a use case |
|
I'll throw an idea into the mix. We probably want two things:
Of course, |
|
Thanks. Looking at how Icechunk would implement Would you expect the size of metadata documents to show up in total for |
|
I've pushed an update that adds a |
| Parameters | ||
| ---------- | ||
| prefix : str | ||
| The prefix of the directory to measure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we offer implementers the following in documentation?:
This function will be called by zarr using a prefix that is the path of a group, an array, or the root. Implementations can choose to do undefined behavior when that is not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure... I was hoping we could somehow ensure that we don't call it with anything other than a group / array / root path, but users can directly use Store.getsize_prefix and they can do whatever.
LMK if you want any more specific guidance on what to do (e.g. raise a ValueError). I'm hesitant about trying to force required exceptions into an ABC / interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant about trying to force required exceptions into an ABC / interface.
And now I'm noticing that I've done exactly that in getsize, with requiring implementations to raise FileNotFoundError if the key isn't found :)
| """ | ||
| keys = [x async for x in self.list_prefix(prefix)] | ||
| sizes = await gather(*[self.getsize(key) for key in keys]) | ||
| return sum(sizes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This materializes the full list of keys in memory, can we maintain the generator longer to avoid that?
Also, this has unlimited concurrency, for a potentially very large number of keys. It could easily create millions of async tasks. We should probably run in chunks limited by the value of the concurrency setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See concurrent_map for an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This materializes the full list of keys in memory, can we maintain the generator longer to avoid that?
I don't immediately see how that's possible.
The best I'm coming up with is a fold-like function that asynchronously iterates through keys from list_prefix and (asynchronously) calls self.getsize to update the size. Sounds kinda complicated.
FWIW, it looks like concurrent_map wants an iterable of items:
> return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
E TypeError: 'async_generator' object is not iterable```
In 7cbc500 I've hacked in some support for AsyncIterable there. I haven't had enough coffee to figure out what the flow of
return await asyncio.gather(*[asyncio.ensure_future(run(item)) async for item in items])
is. I'm a bit worried the async for item in items is happening immediately, so we end up building that list of keys in memory anyway.
We should probably run in chunks limited by the value of the concurrency setting.
Fixed. We should probably replace all instances of asyncio.gather with a concurrency-limited version. I'll make a separate issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5f1d036 removed support for AsyncIterable in concurrent_map, replacing it with a TODO.
I think there's some discussion around improving our use of asyncio to handle cases like this (using queues to mediate task producers like list_prefix and consumers like getsize) that will address this.
The unbounded concurrency issue you raised, is still fixed. It's just the loading of keys into memory that's not yet addressed.
Something like this interface
would be very useful for virtualizarr, as then we can easily and efficiently learn the byte range lengths of all objects in a store, in order to ingest existing zarr as virtual zarr. EDIT: xref zarr-developers/VirtualiZarr#262 (comment) |
jhamman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @TomAugspurger :)
d-v-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
|
ah, we are getting some test failures after bringing in the latest changes from |
|
Should be all set now. |
Closes #2420
One difference from Zarr v2, its
getsizeseemed to return-1if the concrete backend didn't provide agetsizemethod. I think returning a "bad" integer like from a function that returns integers is dangerous. I've implemented a slow but correct default that just reads the object and callslenon the bytes.[Description of PR]
TODO: