Skip to content

DOC: create_array(..., data=,...) #2809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DerWeh opened this issue Feb 9, 2025 · 7 comments
Open

DOC: create_array(..., data=,...) #2809

DerWeh opened this issue Feb 9, 2025 · 7 comments
Labels
documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic

Comments

@DerWeh
Copy link

DerWeh commented Feb 9, 2025

Describe the issue linked to the documentation

I am very confused about the argument data in create_array. A common use case is to simply serialize an in memory array, in which case I tend to pass it as the data=in_memory_array argument. However, I cannot find the data argument in the documentation.

Using IPyhon, on the other hand, zarr.create_array clearly has the argument, while zarr.Group.create_array doesn't seem to expose the interface. I am quite confused about the discrepancy. If this is intentional, please document it.
LLM also suggest that

zarr.create_array("store.zarr", data=in_memory_data)

is more efficient than

arr = zar.create_arra("store.zarr", shape=in_memory_data.shape, dtype=in_memory_data.dtype)
arr[...] = in_memory_data

I have no idea whether this is true or not. zarr.create_array(..., data=in_memory_data) might be indeed more efficient as it seems to be written asynchronously. But the documentation seems to by quite lacking, what the best practice is.


This might be a bit out of scope for this issue, this issue, so please tell me if it's out of scope. But from the documentation, I don't really see how to leverage the asynchronous nature of the zarr implementation. A common pattern I encounter is, that data is generated in parallel using multiprocessing (as it is CPU bound) and persisted using zarr (probably disc bound). Is there a preferred pattern, to use zarr as an asynchronous sink for the generated data? If so, it would be great to include it in the docs.

Suggested fix for documentation

No response

@DerWeh DerWeh added documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic labels Feb 9, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Feb 9, 2025

thanks for this issue @DerWeh. the data keyword argument for create_array was added relatively recently and it looks like I forgot to add it to the Group.create_array method. This should be a simple fix.

This might be a bit out of scope for this issue, this issue, so please tell me if it's out of scope. But from the documentation, I don't really see how to leverage the asynchronous nature of the zarr implementation. A common pattern I encounter is, that data is generated in parallel using multiprocessing (as it is CPU bound) and persisted using zarr (probably disc bound). Is there a preferred pattern, to use zarr as an asynchronous sink for the generated data? If so, it would be great to include it in the docs.

I agree that the documentation should say more about this. Basically all of the non-async functions like (like Array.__setitem__ are designed to take advantage of concurrency. But this concurrency is only useful for performance if your underlying storage layer is actually async. I don't think python's interface to the local file system is asynchronous so there's nothing for you to leverage. But if you were writing to cloud storage like s3, then you would gain a lot of performance from the async layer, even without accessing it.

@DerWeh
Copy link
Author

DerWeh commented Feb 9, 2025

Thanks for the clarification. Adding the data keyword seems simple enough. You're also right with it to be recent, on version 3.0.2 it is in fact documented on rtfd, while on 3.0.1 it's not available yet. Sorry for not making sure that I am reading the latest documentation.

As far as I know, Python's standard library uses synchronous operations for files. There are, however, libraries like aiofiles (which I haven't tried so far). If I understand you correctly, using such a library as storage backend, we could expect performance improvements?

@OPMTerra
Copy link
Contributor

OPMTerra commented Mar 4, 2025

Hi @d-v-b 👋,

I'd like to help complete the documentation for create_array(data=...).

Proposed Changes:

  1. Add data parameter to Group.create_array docstring
  2. Document performance benefits vs assignment
  3. Include async usage example for cloud storage

Is this approach okay?

@d-v-b
Copy link
Contributor

d-v-b commented Mar 4, 2025

that sounds good! note that the docstring changes might already by handled by this pr : #2819, but please weigh in on that PR if you would like to see anything changed there

@OPMTerra
Copy link
Contributor

OPMTerra commented Mar 4, 2025

@d-v-b 👋,

Thanks for pointing me to #2819 ! I've reviewed the changes and see that it addresses parameter consistency for create_array.

Observations:

Remaining Gaps:

  • No user guide examples for data usage
  • No explanation of performance benefits vs arr[...] = data

Can I focus on updating docs/user-guide/arrays.rst with:

  1. Example using data=...
  2. Benchmark snippet comparing data= vs assignment
  3. Async storage note (e.g., S3 + data=...)

Would this be helpful?

@d-v-b
Copy link
Contributor

d-v-b commented Mar 4, 2025

that would be great, thank you!

@OPMTerra
Copy link
Contributor

OPMTerra commented Mar 5, 2025

Hi @d-v-b 👋,

I've opened a PR #2890 to address this issue!

Changes included:

  • Added data= example to arrays.rst
  • Clarified async behavior for both methods

Let me know if any adjustments are needed!
Thank you for your guidance 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
Development

No branches or pull requests

3 participants