DOC: create_array(..., data=,...) #2809

DerWeh · 2025-02-09T14:21:03Z

Describe the issue linked to the documentation

I am very confused about the argument data in create_array. A common use case is to simply serialize an in memory array, in which case I tend to pass it as the data=in_memory_array argument. However, I cannot find the data argument in the documentation.

Using IPyhon, on the other hand, zarr.create_array clearly has the argument, while zarr.Group.create_array doesn't seem to expose the interface. I am quite confused about the discrepancy. If this is intentional, please document it.
LLM also suggest that

zarr.create_array("store.zarr", data=in_memory_data)

is more efficient than

arr = zar.create_arra("store.zarr", shape=in_memory_data.shape, dtype=in_memory_data.dtype)
arr[...] = in_memory_data

I have no idea whether this is true or not. zarr.create_array(..., data=in_memory_data) might be indeed more efficient as it seems to be written asynchronously. But the documentation seems to by quite lacking, what the best practice is.

This might be a bit out of scope for this issue, this issue, so please tell me if it's out of scope. But from the documentation, I don't really see how to leverage the asynchronous nature of the zarr implementation. A common pattern I encounter is, that data is generated in parallel using multiprocessing (as it is CPU bound) and persisted using zarr (probably disc bound). Is there a preferred pattern, to use zarr as an asynchronous sink for the generated data? If so, it would be great to include it in the docs.

Suggested fix for documentation

No response

The text was updated successfully, but these errors were encountered:

d-v-b · 2025-02-09T15:22:14Z

thanks for this issue @DerWeh. the data keyword argument for create_array was added relatively recently and it looks like I forgot to add it to the Group.create_array method. This should be a simple fix.

This might be a bit out of scope for this issue, this issue, so please tell me if it's out of scope. But from the documentation, I don't really see how to leverage the asynchronous nature of the zarr implementation. A common pattern I encounter is, that data is generated in parallel using multiprocessing (as it is CPU bound) and persisted using zarr (probably disc bound). Is there a preferred pattern, to use zarr as an asynchronous sink for the generated data? If so, it would be great to include it in the docs.

I agree that the documentation should say more about this. Basically all of the non-async functions like (like Array.__setitem__ are designed to take advantage of concurrency. But this concurrency is only useful for performance if your underlying storage layer is actually async. I don't think python's interface to the local file system is asynchronous so there's nothing for you to leverage. But if you were writing to cloud storage like s3, then you would gain a lot of performance from the async layer, even without accessing it.

DerWeh · 2025-02-09T16:12:57Z

Thanks for the clarification. Adding the data keyword seems simple enough. You're also right with it to be recent, on version 3.0.2 it is in fact documented on rtfd, while on 3.0.1 it's not available yet. Sorry for not making sure that I am reading the latest documentation.

As far as I know, Python's standard library uses synchronous operations for files. There are, however, libraries like aiofiles (which I haven't tried so far). If I understand you correctly, using such a library as storage backend, we could expect performance improvements?

OPMTerra · 2025-03-04T15:30:15Z

Hi @d-v-b 👋,

I'd like to help complete the documentation for create_array(data=...).

Proposed Changes:

Add data parameter to Group.create_array docstring
Document performance benefits vs assignment
Include async usage example for cloud storage

Is this approach okay?

d-v-b · 2025-03-04T15:32:03Z

that sounds good! note that the docstring changes might already by handled by this pr : #2819, but please weigh in on that PR if you would like to see anything changed there

OPMTerra · 2025-03-04T20:39:21Z

@d-v-b 👋,

Thanks for pointing me to #2819 ! I've reviewed the changes and see that it addresses parameter consistency for create_array.

Observations:

PR Make create_array signatures consistent #2819 adds data to Group.create_array signatures ✅
Docstrings now include data parameter in code ✅

Remaining Gaps:

No user guide examples for data usage
No explanation of performance benefits vs arr[...] = data

Can I focus on updating docs/user-guide/arrays.rst with:

Example using data=...
Benchmark snippet comparing data= vs assignment
Async storage note (e.g., S3 + data=...)

Would this be helpful?

d-v-b · 2025-03-04T20:40:50Z

that would be great, thank you!

OPMTerra · 2025-03-05T14:32:58Z

Hi @d-v-b 👋,

I've opened a PR #2890 to address this issue!

Changes included:

Added data= example to arrays.rst
Clarified async behavior for both methods

Let me know if any adjustments are needed!
Thank you for your guidance 🙏

DerWeh added documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic labels Feb 9, 2025

d-v-b mentioned this issue Feb 9, 2025

Group.create_array does not take a data keyword argument #2810

Open

This was referenced Mar 1, 2025

Monthly issue metrics report #2877

Closed

Monthly issue metrics report sanketverma1704/zarr-python#8

Open

Monthly issue metrics report enthusiastdev121/zarr-python#17

Open

OPMTerra added a commit to OPMTerra/zarr-python that referenced this issue Mar 4, 2025

DOC: Add data= examples and async note (fix zarr-developers#2809)

68a7d56

OPMTerra mentioned this issue Mar 4, 2025

DOC: Add data= examples and async note (fix #2809) #2890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: create_array(..., data=,...) #2809

DOC: create_array(..., data=,...) #2809

DerWeh commented Feb 9, 2025

d-v-b commented Feb 9, 2025

DerWeh commented Feb 9, 2025

OPMTerra commented Mar 4, 2025

d-v-b commented Mar 4, 2025

OPMTerra commented Mar 4, 2025

d-v-b commented Mar 4, 2025

OPMTerra commented Mar 5, 2025

DOC: create_array(..., data=,...) #2809

DOC: create_array(..., data=,...) #2809

Comments

DerWeh commented Feb 9, 2025

Describe the issue linked to the documentation

Suggested fix for documentation

d-v-b commented Feb 9, 2025

DerWeh commented Feb 9, 2025

OPMTerra commented Mar 4, 2025

d-v-b commented Mar 4, 2025

OPMTerra commented Mar 4, 2025

d-v-b commented Mar 4, 2025

OPMTerra commented Mar 5, 2025