Skip to content

Broken fill_value encoding in consolidated_metadata when writing format v2 from zarr-python 3 #2979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ianhi opened this issue Apr 11, 2025 · 3 comments
Labels
bug Potential issues with the zarr-python library

Comments

@ianhi
Copy link
Contributor

ianhi commented Apr 11, 2025

Zarr version

3.0.7.dev8+g018f61d9

Numcodecs version

0.15.1

Python Version

3.12

Operating System

mac

Installation

from source

Description

I think this is similar to #2741 but for consolidated metadata. Namely that bytes as fill_value for zarr format 2 aren't properly encoded/decoded in consolidated metadata.

Steps to reproduce

import zarr
import json

fill_value = b'A'
group = zarr.group(store='blah.zarr', zarr_format=2, overwrite=True)
array = group.create(name ='x', shape=(3,), dtype='|S4', fill_value=fill_value,)
array[:] = values
zarr.consolidate_metadata("blah.zarr")

# if we load it then the fill value no longer matches
consolidated = zarr.open_group("blah.zarr")
cons_fill = consolidated.metadata.consolidated_metadata.metadata['x'].fill_value

# also directly inspect the non consolidated metadata
array_fill = zarr.open_array("blah.zarr/x").metadata.fill_value
print("Loaded Values")
print(f"array:\t {array_fill}")
print(f"consolidated: \t{cons_fill}")
print('\n')


# directly inspect the actual values on disk

print("On Disk values:")
with open("blah.zarr/x/.zarray") as f:
    print(".zarray:", json.load(f)['fill_value'])

with open("blah.zarr/.zmetadata") as f:
    print("consolidated/x/.zarray:", json.load(f)['metadata']['x/.zarray']['fill_value'])

outputs:

Loaded Values
array:	 b'A'
consolidated: 	[b'65']


On Disk values:
.zarray: QQ==
consolidated/x/.zarray: [65]

Additional output

No response

@ianhi ianhi added the bug Potential issues with the zarr-python library label Apr 11, 2025
@jhamman
Copy link
Member

jhamman commented Apr 12, 2025

A datapoint here is that if you base64 decode the on disk value QQ==, you get b'A':

In [1]: import base64

In [2]: base64.b64decode("QQ==")
Out[2]: b'A'

@ianhi
Copy link
Contributor Author

ianhi commented Apr 16, 2025

I'm not planing on looking into this further but putting down here the notes from debugging. The incorrect encoding happens in the to_dict function which is defined here:

def to_dict(self) -> dict[str, JSON]:
"""
Recursively serialize this model to a dictionary.
This method inspects the fields of self and calls `x.to_dict()` for any fields that
are instances of `Metadata`. Sequences of `Metadata` are similarly recursed into, and
the output of that recursion is collected in a list.
"""
out_dict = {}
for field in fields(self):
key = field.name
value = getattr(self, key)
if isinstance(value, Metadata):
out_dict[field.name] = getattr(self, field.name).to_dict()
elif isinstance(value, str):
out_dict[key] = value
elif isinstance(value, Sequence):
out_dict[key] = tuple(v.to_dict() if isinstance(v, Metadata) else v for v in value)
else:
out_dict[key] = value

with the issue being that the bytes were falling into the Sequence branch. A diff like this seems to fix that issue:

diff --git a/src/zarr/abc/metadata.py b/src/zarr/abc/metadata.py
index a56f9866..64ed6f4d 100644
--- a/src/zarr/abc/metadata.py
+++ b/src/zarr/abc/metadata.py
@@ -28,7 +28,7 @@ class Metadata:
             value = getattr(self, key)
             if isinstance(value, Metadata):
                 out_dict[field.name] = getattr(self, field.name).to_dict()
-            elif isinstance(value, str):
+            elif isinstance(value, str | bytes):
                 out_dict[key] = value
             elif isinstance(value, Sequence):
                 out_dict[key] = tuple(v.to_dict() if isinstance(v, Metadata) else v for v in value)

but then there is an error in this block:

items[ZMETADATA_V2_JSON] = prototype.buffer.from_bytes(
json.dumps(
{"metadata": d, "zarr_consolidated_format": 1},
cls=V3JsonEncoder,
).encode()

when ther V3JsonEncoder doesn't know how to process bytes. It seems like this ought to be a V2Encoder but there is no such object. A diff like this gets most of the way, but then there are other errors in the test suite that I didn't track down.

../../../miniforge3/lib/python3.12/json/encoder.py:439: in _iterencode
diff --git a/src/zarr/core/group.py b/src/zarr/core/group.py
index da2aa5f7..26a1c0db 100644
--- a/src/zarr/core/group.py
+++ b/src/zarr/core/group.py
@@ -369,6 +369,12 @@ class GroupMetadata(Metadata):
                             },
                         }

+                    if "fill_value" in v:
+                        from zarr.core.metadata.v2 import _serialize_fill_value
+
+                        d[f"{k}/{ZARRAY_JSON}"]["fill_value"] = _serialize_fill_value(
+                            v["fill_value"], np.dtype(v["dtype"])
+                        )
                 items[ZMETADATA_V2_JSON] = prototype.buffer.from_bytes(
                     json.dumps(
                         {"metadata": d, "zarr_consolidated_format": 1},

@ianhi
Copy link
Contributor Author

ianhi commented May 17, 2025

I think this will be fixed by #2874

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

2 participants