-
-
Notifications
You must be signed in to change notification settings - Fork 354
add numcodec protocol #3318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add numcodec protocol #3318
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3318 +/- ##
==========================================
+ Coverage 94.54% 94.55% +0.01%
==========================================
Files 78 79 +1
Lines 9423 9448 +25
==========================================
+ Hits 8909 8934 +25
Misses 514 514
🚀 New features to boost your workflow:
|
The way our code coverage works, class and function definitions are not counted as tested if they are imported by pytest prior to running actual tests. Pytest imports from a lot of our library due to fixtures defined in This means adding new functions always reduces coverage, because the function definition is uncovered. |
e5ffc33
to
ef31c5b
Compare
@@ -1282,7 +1282,7 @@ def test_gpu_basic(store: Store, zarr_format: ZarrFormat | None) -> None: | |||
dtype=src.dtype, | |||
overwrite=True, | |||
zarr_format=zarr_format, | |||
compressors=compressors, | |||
compressors=compressors, # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of type: ignore
comments currently dotted throughout - what's the reason for that? Is it perhaps because numcodecs.abc.Codec doesn't currently conform to the new protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this particular example is a test, and if you look a few lines above you will see that compressors
is either None
or the string "auto", and mypy models this as str | None
, leading to this error:
tests/test_api.py:1285: error: Argument "compressors" to "create_array" has incompatible type "str | None"; expected "CompressorsLike" [arg-type]
I suspect the other type: ignore
statements were added for different reasons
Spinning this question out into the main thread: without type annotations on |
To make sure I understand, does passing an existing numcodecs codec to |
Shouldn't it annotate our expectations, but allow for the (one?) exception to mark itself as non compliant? We probably don't want any other codecs making Object, and this is a way to say that. In fact, it raises the other question, for a different discussion: is having the pickle protocol reasonable? |
Yes, which is why in this PR I am annotating the
It might be reasonable for some applications (it was evidently reasonable enough to get implemented in the first place). It's also incredibly dangerous, because it performs arbitrary code execution. |
this is a great question, and the answer depends on which type checker you ask. if we take this script: # /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr"
# ]
# ///
import zarr
from numcodecs import GZip
import numcodecs
from zarr.abc.numcodec import Numcodec
a: Numcodec = GZip()
b: numcodecs.abc.Codec = GZip() mypy reports no problems:
but the more thorough based-pyright type checker complains about both
so, tldr is that the status quo is broken because numcodecs doesn't support type checking, AND the new protocol is using a too-narrow type. i'll look into fixing this, but it's important to keep in mind that the status quo is broken. edit: changing the numcodecs annotation from |
…n into feat/numcodecs-protocol
as of 82992c5 I'm using |
…at/numcodecs-protocol
Something discussed briefly on the dev call today: I'd love if we could avoid the checks ( As @d-v-b mentioned, we'd ideally have some validation at the outer layer. Once we're past that, we know we have alid metadata, so we can get from there to (what should be) a valid python codec class. At some point we should be able to avoid these checks. But we do this already on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some very minor suggestions around providing docstrings since the typing is now so broad.
Mostly, I'm confused about the protocol not aligning with the Zarr V3 specification. https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#id22 states:
Logically, a codec c must define three properties:
c.compute_encoded_representation_type(decoded_representation_type), a procedure that determines the encoded representation based on the decoded representation and any codec parameters. In the case of a decoded representation that is a multi-dimensional array, the shape and data type of the encoded representation must be computable based only on the shape and data type, but not the actual element values, of the decoded representation. If the decoded_representation_type is not supported, this algorithm must fail with an error.
c.encode(decoded_value), a procedure that computes the encoded representation, and is used when writing an array.
c.decode(encoded_value, decoded_representation_type), a procedure that computes the decoded representation, and is used when reading an array
It doesn't seem like the V2 spec provided any definitions for a codec. Will external codec developers follow the numcodec protocol or the Zarr V3 specification?
Co-authored-by: Max Jones <[email protected]>
Co-authored-by: Max Jones <[email protected]>
Co-authored-by: Max Jones <[email protected]>
Co-authored-by: Max Jones <[email protected]>
Co-authored-by: Max Jones <[email protected]>
This is deliberate. This protocol is a model of a thing that already exists (the numcodecs codec ABC). The numcodecs codec ABC is not a v3 codec, so neither is this model. There is no intention of using the
Although the v2 spec does define the JSON form of a codec, it does not define the codec API. But the protocol I'm adding in this PR is narrowly scoped to modeling a particular API in numcodecs, not anything in the v2 spec (even though that numcodecs API does implement the v2 spec). |
Reminder that we are adding this protocol for compatibility with existing codecs that were written for zarr python 2.x. Developers working on new codecs that target zarr python 3.x should almost certainly use the v3-style codec API defined in |
from typing_extensions import Protocol | ||
|
||
|
||
class Numcodec(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this work with https://docs.python.org/3/library/typing.html#typing.runtime_checkable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried :( but I learned that @runtime_checkable
doesn't work for protocols with non-methods (i.e., attributes), and the numcodecs.abc.Codec
ABC uses codec_id
as an attribute. I'm not 100% sure we need the codec_id
here, but if we ever wanted to register these codecs, then it would be important.
This PR adds a protocol to model
numcodecs.abc.Codec
. The motivation for this protocol is to ensure that we can process external codecs that adhere to the numcodecs API without needing numcodecs as a dependency.TODO:
docs/user-guide/*.rst
changes/