Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree: selecting via [["a"]] suggests subset, which is missing #10014

Open
5 tasks
mathause opened this issue Jan 31, 2025 · 2 comments
Open
5 tasks

DataTree: selecting via [["a"]] suggests subset, which is missing #10014

mathause opened this issue Jan 31, 2025 · 2 comments
Labels
bug topic-DataTree Related to the implementation of a DataTree class

Comments

@mathause
Copy link
Collaborator

What happened?

calling dt[["a"]] suggests to use .subset which does not exist.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import xarray as xr
a = xr.Dataset(data_vars={"x": [10, 20]}, coords={"time": [0, 1]})

dt = xr.DataTree()
dt["a"] = a

dt[["a"]] # fails and suggests subset
dt.subset # fails

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

No response

Environment

master

@mathause mathause added bug needs triage Issue that has not been reviewed by xarray team member topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 31, 2025
@mathause mathause changed the title selecting via tags and subset DataTree: selecting via [["a"]] suggests subset, which is missing Jan 31, 2025
@mathause mathause changed the title DataTree: selecting via [["a"]] suggests subset, which is missing DataTree: selecting via [["a"]] suggests subset, which is missing Jan 31, 2025
@eschalkargans
Copy link

Xarray issue: DataTree: selecting via [["a"]] suggests subset, which is missing

#10014

Setup

import xarray as xr
xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.13.1 (main, Feb 10 2025, 10:59:30) [GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-134-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.4-development

xarray: 2025.3.0
pandas: 2.2.3
numpy: 2.2.4
scipy: 1.15.2
netCDF4: 1.7.2
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.4.post1
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.10.1
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: None
pip: 23.2.1
conda: None
pytest: 8.3.5
mypy: 1.15.0
IPython: 8.32.0
sphinx: None

With Dataset

xds = xr.Dataset({"a": xr.DataArray([0], dims="x")})
print(xds)
<xarray.Dataset> Size: 8B
Dimensions:  (x: 1)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 8B 0
print(xds["a"])
<xarray.DataArray 'a' (x: 1)> Size: 8B
array([0])
Dimensions without coordinates: x
print(xds[["a"]])
<xarray.Dataset> Size: 8B
Dimensions:  (x: 1)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 8B 0

Example of trying to select a subset of variables, with a variable not in the dataset:

print(xds[["a", "not in dataset!"]])
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

.venv/lib/python3.13/site-packages/xarray/core/dataset.py in ?(self, names)
   1168                 variables[name] = self._variables[name]
   1169             except KeyError:
-> 1170                 ref_name, var_name, var = _get_virtual_variable(
   1171                     self._variables, name, self.sizes


KeyError: 'not in dataset!'


During handling of the above exception, another exception occurred:


KeyError                                  Traceback (most recent call last)

Cell In[17], line 1
----> 1 print(xds[["a", "not in dataset!"]])


File .venv/lib/python3.13/site-packages/xarray/core/dataset.py:1321, in Dataset.__getitem__(self, key)
   1318         raise KeyError(message) from e
   1320 if utils.iterable_of_hashable(key):
-> 1321     return self._copy_listed(key)
   1322 raise ValueError(f"Unsupported key-type {type(key)}")


File .venv/lib/python3.13/site-packages/xarray/core/dataset.py:1170, in Dataset._copy_listed(self, names)
   1168     variables[name] = self._variables[name]
   1169 except KeyError:
-> 1170     ref_name, var_name, var = _get_virtual_variable(
   1171         self._variables, name, self.sizes
   1172     )
   1173     variables[var_name] = var
   1174     if ref_name in self._coord_names or ref_name in self.dims:


File .venv/lib/python3.13/site-packages/xarray/core/dataset_utils.py:79, in _get_virtual_variable(variables, key, dim_sizes)
     77 split_key = key.split(".", 1)
     78 if len(split_key) != 2:
---> 79     raise KeyError(key)
     81 ref_name, var_name = split_key
     82 ref_var = variables[ref_name]


KeyError: 'not in dataset!'

See __getitem__ in the dataset.py:

return self._copy_listed(key)

When the passed object is iterable or hashable, it calls _copy_listed:

def _copy_listed(self, names: Iterable[Hashable]) -> Self:

With DataTree

xdt = xr.DataTree.from_dict({"/": xds})
print(xdt)
<xarray.DataTree>
Group: /
    Dimensions:  (x: 1)
    Dimensions without coordinates: x
    Data variables:
        a        (x) int64 8B 0
print(xdt["a"])
<xarray.DataArray 'a' (x: 1)> Size: 8B
array([0])
Dimensions without coordinates: x
print(xdt[["a"]])
---------------------------------------------------------------------------

NotImplementedError                       Traceback (most recent call last)

Cell In[20], line 1
----> 1 print(xdt[["a"]])


File .venv/lib/python3.13/site-packages/xarray/core/datatree.py:941, in DataTree.__getitem__(self, key)
    938     return self._get_item(path)
    939 elif utils.is_list_like(key):
    940     # iterable of variable names
--> 941     raise NotImplementedError(
    942         "Selecting via tags is deprecated, and selecting multiple items should be "
    943         "implemented via .subset"
    944     )
    945 else:
    946     raise ValueError(f"Invalid format for key: {key}")


NotImplementedError: Selecting via tags is deprecated, and selecting multiple items should be implemented via .subset

See __getitem__ in the datatree.py:

raise NotImplementedError(

The error message comes from datatree itself and states:

NotImplementedError: Selecting via tags is deprecated, and selecting multiple items should be implemented via .subset

Trying to use .subset fails with an AttributeError:

print(xdt.subset(["a"]))
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

Cell In[21], line 1
----> 1 print(xdt.subset(["a"]))


File .venv/lib/python3.13/site-packages/xarray/core/common.py:306, in AttrAccessMixin.__getattr__(self, name)
    304         with suppress(KeyError):
    305             return source[name]
--> 306 raise AttributeError(
    307     f"{type(self).__name__!r} object has no attribute {name!r}"
    308 )


AttributeError: 'DataTree' object has no attribute 'subset'

How can we select multiple groups in a DataTree, like we can in Dataset?

@eschalkargans
Copy link

eschalkargans commented Mar 31, 2025

Note: for now, the workaround I have been using is to always convert the DataTree to Dataset before selecting multiple variables:

print(xdt.to_dataset()[["a"]])
<xarray.Dataset> Size: 8B
Dimensions:  (x: 1)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 8B 0

It only works to select variables, not groups:

xds = xr.Dataset({"a": xr.DataArray([0], dims="x")})
xdt = xr.DataTree.from_dict({"a": xds, "b": xds, "c": xds})
print(xdt)
<xarray.DataTree>
Group: /
├── Group: /a
│       Dimensions:  (x: 1)
│       Dimensions without coordinates: x
│       Data variables:
│           a        (x) int64 8B 0
├── Group: /b
│       Dimensions:  (x: 1)
│       Dimensions without coordinates: x
│       Data variables:
│           a        (x) int64 8B 0
└── Group: /c
        Dimensions:  (x: 1)
        Dimensions without coordinates: x
        Data variables:
            a        (x) int64 8B 0
print(xdt.to_dataset())
<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*

The DataTree only contains groups, no varaibles, so this workaround will not work.

xdt.to_dataset()[["a", "b"]]

-> Fails with KeyError as the resulting Dataset is indeed empty.

I would like to be able to select by paths, to select indistinctively groups or variables (eg below group b and variable c/a)

print(xdt[["/b", "/c/a"]])

The closest form of subset is using .filter, but only works on groups, not actual concrete variables

xds = xr.Dataset({"a": xr.DataArray([0], dims="x")})
xdt = xr.DataTree.from_dict({"toto/tata/a": xds, "toto/tete/a": xds, "toto/tutu/a": xds})
print(xdt)
<xarray.DataTree>
Group: /
└── Group: /toto
    ├── Group: /toto/tata
    │   └── Group: /toto/tata/a
    │           Dimensions:  (x: 1)
    │           Dimensions without coordinates: x
    │           Data variables:
    │               a        (x) int64 8B 0
    ├── Group: /toto/tete
    │   └── Group: /toto/tete/a
    │           Dimensions:  (x: 1)
    │           Dimensions without coordinates: x
    │           Data variables:
    │               a        (x) int64 8B 0
    └── Group: /toto/tutu
        └── Group: /toto/tutu/a
                Dimensions:  (x: 1)
                Dimensions without coordinates: x
                Data variables:
                    a        (x) int64 8B 0
print(xdt.filter(lambda node: node.path in ["/toto/tata", "/toto/tete/a"]))
<xarray.DataTree>
Group: /
└── Group: /toto
    ├── Group: /toto/tata
    └── Group: /toto/tete
        └── Group: /toto/tete/a
                Dimensions:  (x: 1)
                Dimensions without coordinates: x
                Data variables:
                    a        (x) int64 8B 0

Note that the resulting selected group is devoid of variables: /toto/tata is empty.

A more complex filter function to achieve group selection can be used, but seems a bit cumbersome, and still does not select variables:

xds = xr.Dataset({"a": xr.DataArray([0], dims="x"), "b": xr.DataArray([1], dims="y")})
xdt = xr.DataTree.from_dict(
    {
        "/toto/tata/bar": xds,
        "/toto/tete/bar": xds,
        "/toto/tutu/bar": xds,
        "/toto/tata/foo": xds,
        "/toto/tete/foo": xds,
        "/toto/tutu/foo": xds,
    }
)
print(xdt)
<xarray.DataTree>
Group: /
└── Group: /toto
    ├── Group: /toto/tata
    │   ├── Group: /toto/tata/bar
    │   │       Dimensions:  (x: 1, y: 1)
    │   │       Dimensions without coordinates: x, y
    │   │       Data variables:
    │   │           a        (x) int64 8B 0
    │   │           b        (y) int64 8B 1
    │   └── Group: /toto/tata/foo
    │           Dimensions:  (x: 1, y: 1)
    │           Dimensions without coordinates: x, y
    │           Data variables:
    │               a        (x) int64 8B 0
    │               b        (y) int64 8B 1
    ├── Group: /toto/tete
    │   ├── Group: /toto/tete/bar
    │   │       Dimensions:  (x: 1, y: 1)
    │   │       Dimensions without coordinates: x, y
    │   │       Data variables:
    │   │           a        (x) int64 8B 0
    │   │           b        (y) int64 8B 1
    │   └── Group: /toto/tete/foo
    │           Dimensions:  (x: 1, y: 1)
    │           Dimensions without coordinates: x, y
    │           Data variables:
    │               a        (x) int64 8B 0
    │               b        (y) int64 8B 1
    └── Group: /toto/tutu
        ├── Group: /toto/tutu/bar
        │       Dimensions:  (x: 1, y: 1)
        │       Dimensions without coordinates: x, y
        │       Data variables:
        │           a        (x) int64 8B 0
        │           b        (y) int64 8B 1
        └── Group: /toto/tutu/foo
                Dimensions:  (x: 1, y: 1)
                Dimensions without coordinates: x, y
                Data variables:
                    a        (x) int64 8B 0
                    b        (y) int64 8B 1
def select(xdt: xr.DataTree, paths: list[str]) -> xr.DataTree:
    return xdt.filter(lambda node: any(node.path.startswith(p) for p in paths))


sel = select(
    xdt,
    [
        "/toto/tata",  # Whole tata group selected with bar and foo
        "/toto/tete/bar",  # Only bar subgroup is selected
        "/toto/tutu/foo/a",  # Nothing selected!
    ],
)
print(sel)
<xarray.DataTree>
Group: /
└── Group: /toto
    ├── Group: /toto/tata
    │   ├── Group: /toto/tata/bar
    │   │       Dimensions:  (x: 1, y: 1)
    │   │       Dimensions without coordinates: x, y
    │   │       Data variables:
    │   │           a        (x) int64 8B 0
    │   │           b        (y) int64 8B 1
    │   └── Group: /toto/tata/foo
    │           Dimensions:  (x: 1, y: 1)
    │           Dimensions without coordinates: x, y
    │           Data variables:
    │               a        (x) int64 8B 0
    │               b        (y) int64 8B 1
    └── Group: /toto/tete
        └── Group: /toto/tete/bar
                Dimensions:  (x: 1, y: 1)
                Dimensions without coordinates: x, y
                Data variables:
                    a        (x) int64 8B 0
                    b        (y) int64 8B 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests

2 participants