Skip to content

Make create_array signatures consistent #2819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 12, 2025

This PR ensures that the various invocations of create_array are consistent. For reference, we have 4 ways to call create_array:

  • zarr.core.array.create_array (the actual function that does stuff)
  • zarr.api.synchronous.create_array (synchronous wrapper around the async function)
  • zarr.core.group.AsyncGroup.create_array (method on AsyncGroup class that invokes create_array)
  • zarr.core.group.Group.create_array (synchronous wrapper around the AsyncGroup method)

All of these functions should have consistent signatures, but in main they don't, and a big part of this is missing tests. So this PR adds some tests that check that certain pairs of functions have identical parameters (we can't force the return types to match, because the async functions will return coroutines). To make the tests pass, this PR also ensures that all of the invocations of create_array have parameters that are consistent.

I say "consistent" because, when invoking Group.create_array, that method does not take a store argument, because we have one already from the group instance. Also, Group.create_array was recently given an extra keyword argument (compressor) that we need to deprecate. So the test that compares zarr.core.array.create_array with zarr.core.group.AsyncGroup.create_array only checks that all of create_array parameters are present in the group method.

closes #2810

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 12, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 12, 2025

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more defaulty.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 12, 2025
@LDeakin
Copy link
Contributor

LDeakin commented Feb 13, 2025

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more default.

fill_value = None seems a more sensible default to me, with some docs somewhere explaining how this maps to a default fill value for each data type. 0 needs conversion for any non-integer data type, and this is exactly what caused #2792 (comment) in zarr-python 2.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 13, 2025

worth noting that this changes some defaults (we were using a default fill_value of None in some places, and 0 in other places, for example). When in doubt, I picked the default that seemed more default.

fill_value = None seems a more sensible default to me, with some docs somewhere explaining how this maps to a default fill value for each data type. 0 needs conversion for any non-integer data type, and this is exactly what caused #2792 (comment) in zarr-python 2.

That works for me. In main I think we have a mix of 0 and None floating around, I think I chose 0 for this PR because it seemed like the majority, but I'm happy to switch to None here

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 27, 2025

with 642272d the default fill value is None for the array creation funcs.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for this in general - I left one request for a more verbose changelog entry, and one question.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changelog update, looks great to me. I left one more suggestion - I did a bit of testing and this doesn't seem to change the fill value that's set on the arrays (which is good, but I think worth reassuring users)

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

Thanks for the changelog update, looks great to me. I left one more suggestion - I did a bit of testing and this doesn't seem to change the fill value that's set on the arrays (which is good, but I think worth reassuring users)

how did you check this? because for zarr v2 data, it should make a difference -- None is a valid v2 fill value, and it's not the same as 0

@dstansby
Copy link
Contributor

I didn't check for v2 🙈 . If it does make a difference, that should be made very clear in the changelog (and we should think about putting this PR in a non-bugfix release since it's a breaking change?)

@LDeakin
Copy link
Contributor

LDeakin commented Feb 28, 2025

how did you check this? because for zarr v2 data, it should make a difference -- None is a valid v2 fill value, and it's not the same as 0

I didn't check for v2 🙈 . If it does make a difference, that should be made very clear in the changelog (and we should think about putting this PR in a non-bugfix release since it's a breaking change?)

I'd hope that since the below PRs, the metadata of the Zarr V2 data with fill_values=None will have a null fill value but actual stored fill values will match the V3 default rather than being completely undefined.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

@LDeakin that was my impression as well. I think for v2 data the contents of the array metadata will be sensitive to the None / 0 distinction, but at runtime the actual array data should be the same in zarr-python. but this might vary in other implementations.

@LDeakin
Copy link
Contributor

LDeakin commented Feb 28, 2025

I think for v2 data the contents of the array metadata will be sensitive to the None / 0 distinction, but at runtime the actual array data should be the same in zarr-python. but this might vary in other implementations

Indeed! To be honest, I think null fill values in Zarr V2 were a bit of a misstep anyway, mainly because undefined values in partially written chunks cannot be distinguished from real data. This could be a good opportunity for zarr-python to stop writing null fill values altogether in Zarr V2 metadata and just write the default fill value. What do you think?

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

The nullable fill value is also clunky because zarr v2 supports the python Object dtype, and None is a valid instance of the Object dtype, which makes it impossible to distinguish "no fill value" from "the fill value is the literal None" for arrays with that dtype (or any other dtype that admits None as a value).

We would have to get a measure of the impact on users, but if it doesn't inconvenience a lot of people then I would definitely support dropping the creation of arrays with a null fill value, at least for numeric types

@@ -521,7 +521,7 @@ async def test_consolidated_metadata_v2(self):
dtype=dtype,
attributes={"key": "a"},
chunks=(1,),
fill_value=0,
fill_value=None,
Copy link
Contributor

@dstansby dstansby Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify, this PR is a breaking one for writing v2 data, which changes what is written as the fill value if the user doesn't specify it? And this changed test is the manifestation of that breaking change? Is this something we want to avoid breaking?

(sorry for all the questions, just trying to understand 😄 )

Copy link
Contributor Author

@d-v-b d-v-b Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a change to how v2 metadata is written. the actual array data will be written as before. Because the signatures of the affected functions are inconsistent, there is no way to achieve the goal of this PR without changing the default behavior of some of those functions, which will result in different metadata being written to disk as an effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to your questions: it's breaking if someone relied on the default metadata document produced by some, but not all, of our array creation routines. And no, we do not want to avoid breaking, because the cost of a confusing, inconsistent API is worth paying in this case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to all that. In which case this is a backwards incompatible API change, and we promise to increment the major version number for backwards incompatible API changes. So, what do we want to do - bump the version number v4, or re-write the versioning policy?

I'm guessing it's the latter... which is a pain, but I think we should be honest and transparent to users if we're going to put backwards incompatible API changes in minor releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should adjust the versioning policy, and also release the changes in this PR in a minor (non-patch) release. We were already planning on making breaking changes without incrementing the major version number -- deprecated functions like create were slated to be removed long before a 4.x release. So I think the versioning policy needs updating.

@dstansby dstansby self-requested a review March 4, 2025 19:52
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be a bit of a pain, but I'm going to put a request changes on this until we've resolved how we're going to bump the version number (and/or update our versioning policy) for this change. See #2819 (comment) for more context.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

that works for me! would you like to open an issue about the versioning policy, or should I?

@dstansby
Copy link
Contributor

dstansby commented Mar 4, 2025

I can 👍

@d-v-b d-v-b moved this from Done to In progress in zarr-python release 3.1.0 Apr 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Group.create_array does not take a data keyword argument
3 participants