Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr.core.group.Group does not allow to access nested groups using file path-like syntax (?) #2765

Open
aladinor opened this issue Jan 25, 2025 · 7 comments
Labels
bug Potential issues with the zarr-python library

Comments

@aladinor
Copy link

aladinor commented Jan 25, 2025

Zarr version

v3.0

Numcodecs version

v0.15

Python Version

3.11

Operating System

Linux

Installation

conda

Description

Hi everyone,

Working on #9960 issue on Xarray, I discovered that the new Zarr python version does not allow access to group members using file path-like syntax.

Steps to reproduce

This is an MCVE:

import xarray
import xarray.testing as xt
import numpy as np
import zarr


if __name__ == "__main__":
    ds_a = xarray.Dataset({
      "A": (("x", "y"), np.ones((128, 256))),
    })
    ds_b = xarray.Dataset({
        "B": (("y", "x"), np.ones((256, 128)) * 2)
    })
    ds_c = xarray.Dataset({
        "G": (("x", "y"), np.zeros((128, 256)))
    })
    ds_rt = xarray.Dataset({
        "z": (("x", "y"), np.zeros((128, 256))),
        "w": (("x"), np.random.rand(128))
    })

    dt = xarray.DataTree.from_dict(
        {
            "/": ds_rt,
            "/a": ds_a,
            "/b": ds_b,
            "/c/d": ds_c
        }
    )
    path = "testv3_dt.zarr"
    dt.to_zarr(path, compute=True, mode="w")

The group paths in this datatree are ['/', '/c', '/c/d', '/b', '/a']. However, when opening the zarr store back it returns None when trying to get any of these paths

kwargs = {'mode': 'r', 'path': '/', 'storage_options': None, 'synchronizer': None, 'zarr_format': None}
store = zarr.open_consolidated(path, **kwargs)
print(store.get("/a"))
None

Digging a little bit more, I found out that we can get the path for each group in zarr-python v3 using the store.members() method as follows and this will allow us to get the groups within the zarr store.

print([path for path, _ in store.members()])
['a', 'b', 'w', 'c', 'z']

Now, we can access the nested groups using these results

print(store.get("a"))
<Group file://testv3_dt.zarr/a>

Shall zarr-pyhton v3 groups support file path-like syntax to access groups?

Another thing that I noticed is that datasets stored at the root level (ds_rt that contains z and w dataArrays) are not represented as a group (root group "/") but instead represented as zarr Arrays.

print(store.get("z"))
<Array file://testv3_dt.zarr/z shape=(128, 256) dtype=float64>

How could we access the root group (store.get("/")) instead of directly the arrays (store.get("z"))?

Additional output

The Zarr python v2 used return a <class 'zarr.hierarchy.Group'> which allowed us to access nested groups using file path-like syntax.

## this part of the code was executed using zarr-python v2
kwargs = {'mode': 'r', 'path': '/', 'storage_options': None, 'synchronizer': None}
store = zarr.open_consolidated(path, **kwargs)

print(store.get("/"))
<zarr.hierarchy.Group '/' read-only>

print(store.get("/a"))
<zarr.hierarchy.Group '/a' read-only>
@aladinor aladinor added the bug Potential issues with the zarr-python library label Jan 25, 2025
@jhamman
Copy link
Member

jhamman commented Jan 25, 2025

Thanks @aladinor for looking into this. I opened pydata/xarray#9984 to track the datatree+zarr3 integration.

I thought we had fixed the leading slash issue here in zarr so I'm wondering if that is still the cause here or if there is something else going on.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 25, 2025

Thanks for this report, I think there are few things to unpack here:

Shall zarr-pyhton v3 groups support file path-like syntax to access groups?

Definitely. As @jhamman notes some of the weirdness you observed might come from zarr-python v3 using relative paths for the names of arrays and groups (e.g., "a"), vs the use of an absolute path (e.g., "/a"). We should accept both of these as valid inputs for group.get, and I can look into why this isn't working today.

How could we access the root group (store.get("/")) instead of directly the arrays (store.get("z"))?

Can you explain why this is needed? In your example, store is the root group. You already have a reference to it. Why do you need x.get('/') to return the same group as x?

@aladinor
Copy link
Author

aladinor commented Jan 25, 2025

Thanks, @jhamman and @d-v-b for your prompt reply.

Can you explain why this is needed? In your example, store is the root group. You already have a reference to it. Why do you need x.get('/') to return the same group as x?

We have a datatree that looks like this:

<xarray.DataTree>
Group: /Dimensions:  (x: 128, y: 256)
│   Dimensions without coordinates: x, yData variables:
│       z        (x, y) float64 262kB 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0w        (x) float64 1kB 0.01734 0.8962 0.6293 ... 0.2805 0.2753 0.2004
├── Group: /aDimensions:  (x: 128, y: 256)
│       Dimensions without coordinates: x, yData variables:
│           A        (x, y) float64 262kB 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
├── Group: /bDimensions:  (y: 256, x: 128)
│       Dimensions without coordinates: y, xData variables:
│           B        (y, x) float64 262kB 2.0 2.0 2.0 2.0 2.0 ... 2.0 2.0 2.0 2.0 2.0
└── Group: /c
    └── Group: /c/d
            Dimensions:  (x: 128, y: 256)
            Dimensions without coordinates: x, y
            Data variables:
                G        (x, y) float64 262kB 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0

I guess it is because that was the way we handled it in zarr v2. We had a hierarchical structure and then we query all groups including the root group. However, I might need to recheck how to get the root-level dataset (ds_rt).

Group: /Dimensions:  (x: 128, y: 256)
│   Dimensions without coordinates: x, yData variables:
│       z        (x, y) float64 262kB 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0w        (x) float64 1kB 0.01734 0.8962 0.6293 ... 0.2805 0.2753 0.2004

I am still not sure how to get it from the store. Maybe, because the store is itself the root group (?).

@d-v-b
Copy link
Contributor

d-v-b commented Jan 25, 2025

open_consolidated returns a group. in your first example, the variable you bound to "store" is in fact a zarr group (the root group). So you already have a reference to it.

@aladinor
Copy link
Author

Thanks @d-v-b for your explanation. However, I think I found out a more technical explanation for this behaviour. When opening a datatree using xarray we need to iterate over all nested groups within the hierarchical structure, including the root group as shown here:

https://github.com/pydata/xarray/blob/1c7ee65d560fa3067dc4424c672393839fa972d3/xarray/backends/zarr.py#L661C8-L675C11

Then, each nested group can be accessed from the "store" group using the .get() method.

https://github.com/pydata/xarray/blob/1c7ee65d560fa3067dc4424c672393839fa972d3/xarray/backends/zarr.py#L664

Thus, my question is, how can we get the root group using the .get() method?

@d-v-b
Copy link
Contributor

d-v-b commented Jan 26, 2025

I don't think we necessarily want Group.get() to return a copy of itself. So I would recommend special-casing the root group in your iteration. e.g.,

root= {'/': zarr_group}
members = dict(zarr_group.members())
tree =  root | members

@aladinor
Copy link
Author

@d-v-b thanks for your suggestions. We need to do some refactoring, but it makes sense to me. We need to wait for absolute paths before implementing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

3 participants