Broken `fill_value` encoding in `consolidated_metadata` when writing format v2 from zarr-python 3 #2979

ianhi · 2025-04-11T22:53:32Z

Zarr version

3.0.7.dev8+g018f61d9

Numcodecs version

0.15.1

Python Version

3.12

Operating System

mac

Installation

from source

Description

I think this is similar to #2741 but for consolidated metadata. Namely that bytes as fill_value for zarr format 2 aren't properly encoded/decoded in consolidated metadata.

Steps to reproduce

import zarr
import json

fill_value = b'A'
group = zarr.group(store='blah.zarr', zarr_format=2, overwrite=True)
array = group.create(name ='x', shape=(3,), dtype='|S4', fill_value=fill_value,)
array[:] = values
zarr.consolidate_metadata("blah.zarr")

# if we load it then the fill value no longer matches
consolidated = zarr.open_group("blah.zarr")
cons_fill = consolidated.metadata.consolidated_metadata.metadata['x'].fill_value

# also directly inspect the non consolidated metadata
array_fill = zarr.open_array("blah.zarr/x").metadata.fill_value
print("Loaded Values")
print(f"array:\t {array_fill}")
print(f"consolidated: \t{cons_fill}")
print('\n')


# directly inspect the actual values on disk

print("On Disk values:")
with open("blah.zarr/x/.zarray") as f:
    print(".zarray:", json.load(f)['fill_value'])

with open("blah.zarr/.zmetadata") as f:
    print("consolidated/x/.zarray:", json.load(f)['metadata']['x/.zarray']['fill_value'])

outputs:

Loaded Values
array:	 b'A'
consolidated: 	[b'65']


On Disk values:
.zarray: QQ==
consolidated/x/.zarray: [65]

Additional output

No response

The text was updated successfully, but these errors were encountered:

jhamman · 2025-04-12T16:21:48Z

A datapoint here is that if you base64 decode the on disk value QQ==, you get b'A':

In [1]: import base64

In [2]: base64.b64decode("QQ==")
Out[2]: b'A'

ianhi · 2025-04-16T18:02:18Z

I'm not planing on looking into this further but putting down here the notes from debugging. The incorrect encoding happens in the to_dict function which is defined here:

zarr-python/src/zarr/abc/metadata.py

Lines 18 to 37 in 5f49d24

    
               def to_dict(self) -> dict[str, JSON]: 
        
                   """ 
        
                   Recursively serialize this model to a dictionary. 
        
                   This method inspects the fields of self and calls `x.to_dict()` for any fields that 
        
                   are instances of `Metadata`. Sequences of `Metadata` are similarly recursed into, and 
        
                   the output of that recursion is collected in a list. 
        
                   """ 
        
                   out_dict = {} 
        
                   for field in fields(self): 
        
                       key = field.name 
        
                       value = getattr(self, key) 
        
                       if isinstance(value, Metadata): 
        
                           out_dict[field.name] = getattr(self, field.name).to_dict() 
        
                       elif isinstance(value, str): 
        
                           out_dict[key] = value 
        
                       elif isinstance(value, Sequence): 
        
                           out_dict[key] = tuple(v.to_dict() if isinstance(v, Metadata) else v for v in value) 
        
                       else: 
        
                           out_dict[key] = value

with the issue being that the bytes were falling into the Sequence branch. A diff like this seems to fix that issue:

diff --git a/src/zarr/abc/metadata.py b/src/zarr/abc/metadata.py
index a56f9866..64ed6f4d 100644
--- a/src/zarr/abc/metadata.py
+++ b/src/zarr/abc/metadata.py
@@ -28,7 +28,7 @@ class Metadata:
             value = getattr(self, key)
             if isinstance(value, Metadata):
                 out_dict[field.name] = getattr(self, field.name).to_dict()
-            elif isinstance(value, str):
+            elif isinstance(value, str | bytes):
                 out_dict[key] = value
             elif isinstance(value, Sequence):
                 out_dict[key] = tuple(v.to_dict() if isinstance(v, Metadata) else v for v in value)

but then there is an error in this block:

zarr-python/src/zarr/core/group.py

Lines 372 to 376 in 5f49d24

    
           items[ZMETADATA_V2_JSON] = prototype.buffer.from_bytes( 
        
               json.dumps( 
        
                   {"metadata": d, "zarr_consolidated_format": 1}, 
        
                   cls=V3JsonEncoder, 
        
               ).encode()

when ther V3JsonEncoder doesn't know how to process bytes. It seems like this ought to be a V2Encoder but there is no such object. A diff like this gets most of the way, but then there are other errors in the test suite that I didn't track down.

../../../miniforge3/lib/python3.12/json/encoder.py:439: in _iterencode
diff --git a/src/zarr/core/group.py b/src/zarr/core/group.py
index da2aa5f7..26a1c0db 100644
--- a/src/zarr/core/group.py
+++ b/src/zarr/core/group.py
@@ -369,6 +369,12 @@ class GroupMetadata(Metadata):
                             },
                         }

+                    if "fill_value" in v:
+                        from zarr.core.metadata.v2 import _serialize_fill_value
+
+                        d[f"{k}/{ZARRAY_JSON}"]["fill_value"] = _serialize_fill_value(
+                            v["fill_value"], np.dtype(v["dtype"])
+                        )
                 items[ZMETADATA_V2_JSON] = prototype.buffer.from_bytes(
                     json.dumps(
                         {"metadata": d, "zarr_consolidated_format": 1},

ianhi · 2025-05-17T20:37:42Z

I think this will be fixed by #2874

ianhi added the bug Potential issues with the zarr-python library label Apr 11, 2025

ianhi mentioned this issue Apr 11, 2025

⚠️ Nightly upstream-dev CI failed ⚠️ pydata/xarray#10189

Closed

dcherian mentioned this issue Apr 26, 2025

roundtripping bytes with specific _FillValue fails on Zarr 3.0.7 pydata/xarray#10256

Open

This was referenced May 1, 2025

Monthly issue metrics report #3030

Open

Monthly issue metrics report sanketverma1704/zarr-python#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Broken `fill_value` encoding in `consolidated_metadata` when writing format v2 from zarr-python 3 #2979

Broken `fill_value` encoding in `consolidated_metadata` when writing format v2 from zarr-python 3 #2979

ianhi commented Apr 11, 2025

jhamman commented Apr 12, 2025

Uh oh!

ianhi commented Apr 16, 2025

Uh oh!

ianhi commented May 17, 2025

Uh oh!

Uh oh!

Broken fill_value encoding in consolidated_metadata when writing format v2 from zarr-python 3 #2979

Broken fill_value encoding in consolidated_metadata when writing format v2 from zarr-python 3 #2979

Comments

ianhi commented Apr 11, 2025

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

jhamman commented Apr 12, 2025

Uh oh!

ianhi commented Apr 16, 2025

Uh oh!

ianhi commented May 17, 2025

Uh oh!

Broken `fill_value` encoding in `consolidated_metadata` when writing format v2 from zarr-python 3 #2979

Broken `fill_value` encoding in `consolidated_metadata` when writing format v2 from zarr-python 3 #2979