Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extension array indexes #9671

Open
wants to merge 228 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

Identical to kmuehlbauer#1 - probably not very helpful in terms of changes since https://github.com/kmuehlbauer/xarray/tree/any-time-resolution-2 contains most of it....

kmuehlbauer and others added 30 commits October 18, 2024 07:31
…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution
…t resolution, fix code and tests to allow this
… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)
…rray` series creating an extension array when `.array` is accessed
@@ -1118,7 +1118,8 @@ def test_groupby_math_nD_group() -> None:
expected = da.isel(x=slice(30)) - expanded_mean
expected["labels"] = expected.labels.broadcast_like(expected.labels2d)
expected["num"] = expected.num.broadcast_like(expected.num2d)
expected["num2d_bins"] = (("x", "y"), mean.num2d_bins.data[idxr])
# mean.num2d_bins.data is a pandas IntervalArray so needs to be put in `numpy` to allow indexing
expected["num2d_bins"] = (("x", "y"), mean.num2d_bins.data.to_numpy()[idxr])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically backwards-incompatible, but an improvement IMO. Just noting in case someone looks this up in the future.

Before:

num2d_bins
mean.num2d_bins
<xarray.DataArray 'num2d_bins' (num2d_bins: 2)> Size: 16B
array([Interval(0, 4, closed='right'), Interval(4, 6, closed='right')],
      dtype=object)
Coordinates:
  * num2d_bins  (num2d_bins) object 16B (0, 4] (4, 6]

After:

ipdb> mean.num2d_bins
mean.num2d_bins
<xarray.DataArray 'num2d_bins' (num2d_bins: 2)> Size: 16B
array([Interval(0, 4, closed='right'), Interval(4, 6, closed='right')],
      dtype=object)
Coordinates:
  * num2d_bins  (num2d_bins) interval[int64, right] 16B (0, 4] (4, 6]

@@ -834,6 +834,7 @@ def chunk(
if chunkmanager.is_chunked_array(data_old):
data_chunked = chunkmanager.rechunk(data_old, chunks) # type: ignore[arg-type]
else:
ndata: duckarray[Any, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the pandas-specific code, I'm not sure we should do that, we might as well just ask the user to cast.

* main:
  Vendor pandas to xarray conversion tests (pydata#10187)
  Fix: Correct axis labelling with units for FacetGrid plots (pydata#10185)
  Use explicit repo name in upstream wheels (pydata#10181)
  DOC: Update docstring to reflect renamed section (pydata#10180)
@@ -104,17 +104,11 @@ def index_flat(request):
index fixture, but excluding MultiIndex cases.
"""
key = request.param
if key in ["bool-object", "bool-dtype", "nullable_bool", "repeats"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be some weird broadcasting behaviour here.

@dcherian
Copy link
Contributor

Sorry, this is a total mess. Apparently IndexVariable and Variable now behave differently, and I'm not sure why.

@@ -945,7 +944,7 @@ def load(self, **kwargs):
--------
dask.array.compute
"""
self._data = to_duck_array(self._data, **kwargs)
self._data = _maybe_wrap_data(to_duck_array(self._data, **kwargs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just return the PandasExtensionArray wrapper class but I'm wary of exposing that to users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK i did this, it seems much neater at the expense of exposing the PandasExtensionArray wrapper class

* main:
  Bump scientific-python/upload-nightly-action in the actions group (pydata#10192)
  Add new whats-new section (pydata#10190)
  release 2025.03.1 (pydata#10188)
  Support zarr `write_empty_chunks` for zarr-python 3 and up (pydata#10177)
@dcherian dcherian changed the title (fix): extension array indexers Support extension array indexes Apr 1, 2025
@ilan-gold
Copy link
Contributor Author

@dcherian Could you give a bit of background into the changes you pushed? I'm not really following.

Sorry, this is a total mess. Apparently IndexVariable and Variable now behave differently, and I'm not sure why.

Did I do something wrong in the PR without knowing it i.e., bypassing the tests? It would be great to understand!

@dcherian
Copy link
Contributor

dcherian commented Apr 1, 2025

No you didn't do anything wrong per-se.

  1. I wanted the pandas-specific logic to live inside indexing.py as much as possible (and definitely not in namedarray/core.py, so moving that exposed some other warts. The solution right now is to expose PandasExtensionArray wrapper class.
  2. The groupby_bins tests needed to be updated because previously intervalarray got cast to a numpy object array of tuples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants