Skip to content

add numcodec protocol #3318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jul 31, 2025

This PR adds a protocol to model numcodecs.abc.Codec. The motivation for this protocol is to ensure that we can process external codecs that adhere to the numcodecs API without needing numcodecs as a dependency.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 31, 2025
Copy link

codecov bot commented Jul 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.55%. Comparing base (1264a4d) to head (fcc010b).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3318      +/-   ##
==========================================
+ Coverage   94.54%   94.55%   +0.01%     
==========================================
  Files          78       79       +1     
  Lines        9423     9448      +25     
==========================================
+ Hits         8909     8934      +25     
  Misses        514      514              
Files with missing lines Coverage Δ
src/zarr/abc/codec.py 95.06% <100.00%> (+0.77%) ⬆️
src/zarr/abc/numcodec.py 100.00% <100.00%> (ø)
src/zarr/api/asynchronous.py 87.62% <ø> (ø)
src/zarr/api/synchronous.py 92.95% <ø> (ø)
src/zarr/codecs/_v2.py 93.61% <100.00%> (-0.14%) ⬇️
src/zarr/core/_info.py 95.18% <100.00%> (ø)
src/zarr/core/array.py 97.10% <100.00%> (-0.01%) ⬇️
src/zarr/core/metadata/v2.py 91.27% <100.00%> (ø)
src/zarr/registry.py 88.81% <100.00%> (+0.23%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jul 31, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 1, 2025

The way our code coverage works, class and function definitions are not counted as tested if they are imported by pytest prior to running actual tests. Pytest imports from a lot of our library due to fixtures defined in conftest.py, which is why our coverage reports in CI report function definitions as uncovered.

This means adding new functions always reduces coverage, because the function definition is uncovered. Until we fix how our coverage is measured, I excluded instances of NotImplementedError with #pragma: no cover to get coverage green for this PR.

@d-v-b d-v-b force-pushed the feat/numcodecs-protocol branch from e5ffc33 to ef31c5b Compare August 1, 2025 19:36
@@ -1282,7 +1282,7 @@ def test_gpu_basic(store: Store, zarr_format: ZarrFormat | None) -> None:
dtype=src.dtype,
overwrite=True,
zarr_format=zarr_format,
compressors=compressors,
compressors=compressors, # type: ignore[arg-type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of type: ignore comments currently dotted throughout - what's the reason for that? Is it perhaps because numcodecs.abc.Codec doesn't currently conform to the new protocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular example is a test, and if you look a few lines above you will see that compressors is either None or the string "auto", and mypy models this as str | None, leading to this error:

tests/test_api.py:1285: error: Argument "compressors" to "create_array" has incompatible type "str | None"; expected "CompressorsLike" [arg-type]

I suspect the other type: ignore statements were added for different reasons

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 5, 2025

Is it perhaps because numcodecs.abc.Codec doesn't currently conform to the new protocol?

Spinning this question out into the main thread: numcodecs.abc.Codec does not have type annotations for its methods, and even if it did, decode would need to be annotated as returning object thanks to Pickle.decode, which can return literally any python object.

without type annotations on numcodecs.abc.Codec, I don't think we can use a type checker to check if a numcodecs codec is assignable to Numcodec, but this PR does add runtime checks, which are tested.

@dstansby
Copy link
Contributor

dstansby commented Aug 5, 2025

To make sure I understand, does passing an existing numcodecs codec to compressors or filters fail type checking in user code with this PR (because they don't conform to the new protocol)?

@martindurant
Copy link
Member

even if it did, decode would need to be annotated as returning object thanks to Pickle.decode

Shouldn't it annotate our expectations, but allow for the (one?) exception to mark itself as non compliant? We probably don't want any other codecs making Object, and this is a way to say that. In fact, it raises the other question, for a different discussion: is having the pickle protocol reasonable?

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 5, 2025

Shouldn't it annotate our expectations, but allow for the (one?) exception to mark itself as non compliant? We probably don't want any other codecs making Object, and this is a way to say that.

Yes, which is why in this PR I am annotating the encode and decode methods as returning Buffer | NDBuffer.

In fact, it raises the other question, for a different discussion: is having the pickle protocol reasonable?

It might be reasonable for some applications (it was evidently reasonable enough to get implemented in the first place). It's also incredibly dangerous, because it performs arbitrary code execution.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 5, 2025

To make sure I understand, does passing an existing numcodecs codec to compressors or filters fail type checking in user code with this PR (because they don't conform to the new protocol)?

this is a great question, and the answer depends on which type checker you ask.

if we take this script:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr"
# ]
# ///

import zarr

from numcodecs import GZip
import numcodecs
from zarr.abc.numcodec import Numcodec

a: Numcodec = GZip()
b: numcodecs.abc.Codec = GZip()

mypy reports no problems:

bennettd@dvb-desktop-0 ➜  zarr-python git:(feat/numcodecs-protocol) ✗ uvx mypy test_3.py
Success: no issues found in 1 source file

but the more thorough based-pyright type checker complains about both a and b, for different reasons:

Installed 2 packages in 126ms
/home/bennettd/dev/zarr-python/test_3.py
  /home/bennettd/dev/zarr-python/test_3.py:8:8 - warning: Import "zarr" is not accessed (reportUnusedImport)
  /home/bennettd/dev/zarr-python/test_3.py:10:6 - warning: Stub file not found for "numcodecs" (reportMissingTypeStubs)
  /home/bennettd/dev/zarr-python/test_3.py:11:8 - warning: Stub file not found for "numcodecs" (reportMissingTypeStubs)
  /home/bennettd/dev/zarr-python/test_3.py:14:15 - error: Type "GZip" is not assignable to declared type "Numcodec"
    "GZip" is incompatible with protocol "Numcodec"
      "encode" is an incompatible type
        Type "(buf: Unknown) -> bytes" is not assignable to type "(buf: Buffer | NDBuffer) -> (Buffer | NDBuffer)"
          Function return type "bytes" is incompatible with type "Buffer | NDBuffer"
            Type "bytes" is not assignable to type "Buffer | NDBuffer"
      "decode" is an incompatible type
        Type "(buf: Unknown, out: Unknown | None = None) -> (Unknown | bytes)" is not assignable to type "(buf: Buffer | NDBuffer, out: Buffer | NDBuffer | None = None) -> (Buffer | NDBuffer)"
          Function return type "Unknown | bytes" is incompatible with type "Buffer | NDBuffer"
    ... (reportAssignmentType)
  /home/bennettd/dev/zarr-python/test_3.py:15:4 - warning: Type of "abc" is unknown (reportUnknownMemberType)
  /home/bennettd/dev/zarr-python/test_3.py:15:4 - warning: Type of "Codec" is unknown (reportUnknownMemberType)
  /home/bennettd/dev/zarr-python/test_3.py:15:14 - error: "abc" is not a known attribute of module "numcodecs" (reportAttributeAccessIssue)
2 errors, 5 warnings, 0 notes

so, tldr is that the status quo is broken because numcodecs doesn't support type checking, AND the new protocol is using a too-narrow type. i'll look into fixing this, but it's important to keep in mind that the status quo is broken.

edit: changing the numcodecs annotation from numcodecs.abc.Codec to Codec, after importing Codec from numcodecs.abc does pass pyright. but we never use that import in our codebase.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 5, 2025

as of 82992c5 I'm using Any for the input / output types of Numcodec. We can narrow this type as needed later on.

@TomAugspurger
Copy link
Contributor

Something discussed briefly on the dev call today: I'd love if we could avoid the checks (isinstance on main, the custom is_*` checks on this PR).

As @d-v-b mentioned, we'd ideally have some validation at the outer layer. Once we're past that, we know we have alid metadata, so we can get from there to (what should be) a valid python codec class. At some point we should be able to avoid these checks.

But we do this already on main so fixing that here is out of scope for this PR.

Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some very minor suggestions around providing docstrings since the typing is now so broad.

Mostly, I'm confused about the protocol not aligning with the Zarr V3 specification. https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#id22 states:

Logically, a codec c must define three properties:

c.compute_encoded_representation_type(decoded_representation_type), a procedure that determines the encoded representation based on the decoded representation and any codec parameters. In the case of a decoded representation that is a multi-dimensional array, the shape and data type of the encoded representation must be computable based only on the shape and data type, but not the actual element values, of the decoded representation. If the decoded_representation_type is not supported, this algorithm must fail with an error.

c.encode(decoded_value), a procedure that computes the encoded representation, and is used when writing an array.

c.decode(encoded_value, decoded_representation_type), a procedure that computes the decoded representation, and is used when reading an array

It doesn't seem like the V2 spec provided any definitions for a codec. Will external codec developers follow the numcodec protocol or the Zarr V3 specification?

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 10, 2025

Mostly, I'm confused about the protocol not aligning with the Zarr V3 specification. https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#id22 states:

This is deliberate. This protocol is a model of a thing that already exists (the numcodecs codec ABC). The numcodecs codec ABC is not a v3 codec, so neither is this model. There is no intention of using the Numcodec protocol as a v3-style codec in zarr python. Instead, in a later PR I will define a NumcodecWrapper class that is a v3 codec. This class will wrap the methods defined on the Numcodec protocol in a v3-codec-compatible way.

It doesn't seem like the V2 spec provided any definitions for a codec. Will external codec developers follow the numcodec protocol or the Zarr V3 specification?

Although the v2 spec does define the JSON form of a codec, it does not define the codec API. But the protocol I'm adding in this PR is narrowly scoped to modeling a particular API in numcodecs, not anything in the v2 spec (even though that numcodecs API does implement the v2 spec).

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 10, 2025

Will external codec developers follow the numcodec protocol or the Zarr V3 specification?

Reminder that we are adding this protocol for compatibility with existing codecs that were written for zarr python 2.x. Developers working on new codecs that target zarr python 3.x should almost certainly use the v3-style codec API defined in zarr.abc.codec.Codec.

from typing_extensions import Protocol


class Numcodec(Protocol):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried :( but I learned that @runtime_checkable doesn't work for protocols with non-methods (i.e., attributes), and the numcodecs.abc.Codec ABC uses codec_id as an attribute. I'm not 100% sure we need the codec_id here, but if we ever wanted to register these codecs, then it would be important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants