-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Regression in DataArrays created from Pandas #10301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@richard-berg thanks for the issue. import numpy as np
import pandas as pd
import xarray as xr
index1 = np.array([1, 2, 3])
index2 = np.array([1, 2, 4])
srs = pd.Series(index=index1, data=1).convert_dtypes()
arr = srs.to_xarray()
arr + 5 works for me on As for the text coverage, I agree we should increase it. I will add the cases you raise here but since you appear to be aware of more, I would love more guidance. If you look through my PRs here, it's a bit of whack-a-mole because while I think I am using relatively sound practices as I go, xarray has a lot of edge cases in its API that I am not familiar with. If you're aware of some, it would be great to handle them. I would be opposed to somehow going around special casing because now we actually do let through datetimes as well as interval arrays (which are both tested even if it is not immediately obvious). The reason I had to do #9042 was exactly because all of the special casing that existed before was so complex that unraveling it required a massive PR. So with special casing, we would be looking at categoricals, interval types, and datetimes passed through and only numerics excluded, until another type comes along. So I'd somewhat rather be very clear here that everything passes through. |
P.S I see:
And we just made it so that |
Whoops, |
While I could see floats being a good idea, nullable integers do not exist in numpy: import numpy as np
In [3]: np.array([np.nan, 1, 2], dtype="int32") ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[3], line 1
----> 1 np.array([np.nan, 1, 2], dtype="int32")
ValueError: cannot convert float NaN to integer |
Thanks for checking, and for the quick |
I know but this is what we do with numpy masked arrays today. Ideally we'd convert to https://github.com/mdhaber/marray in the near future. Regardless, for now we'd like to enable any "extra" dtypes (categorical, intervals) while using the numpy dtypes as much as we can. |
Interesting! @richard-berg I would be interested why not just limit this behavior to |
+1 here given the above example of |
I should say @richard-berg been thinking about this since you posted and I really appreciate your contribution and effort here btw! Looking forward to seeing you PR :)))) I love to "users" get involved with open source, wish it was easier given corporate situations sometimes. Thanks a million again! |
What happened?
Given:
Now consider:
In xarray 2023.1.0 this gave a reasonable (if weakly-typed) result.
While upgrading to xarray 2025.3.x + pandas 2.x, my colleagues found it now raises:
What did you expect to happen?
Ideally, the result would be:
Minimal Complete Verifiable Example
MVCE confirmation
Anything else we need to know?
The difference is that
arr.dtype
is nowpd.Int64Dtype()
rather thannp.dtype("object")
, thanks to #8723. While arguably an improvement in typing, the xarray core doesn't seem ready to handle the former. In this case,core.dtypes.maybe_promote()
is blindly passing a Pandas dtype tonp.issubdtype
, oops.Patching this immediate issue is more revealing:
reindex
then fails whenduck_array_ops.where(condition, x, y)
tries to coercex
&y
to a common dtype. The new extension-array code inas_shared_dtype
is not at all general: wheny
is a scalar (thefill_value
from the reindex operation), it simply gives up.Once I understood the cause of the
reindex
issue above, producing more -- and much more worrisome -- failures was trivial:I'd venture to say that the pandas
df.to_xarray()
/srs.to_xarray()
methods have become foot-guns, bordering on unusable, now that pandas 2.x has reimplemented all of its native datatypes on top ofExtensionArray
/ExtensionDtype
.The good news is I have a fix. The bad news is it's pretty invasive, needing careful oversight from someone who actually knows what they're doing. (Before this week I'd never used xarray, nor looked at the numpy / pandas source code.)
For now I might recommend excluding ALL numeric dtypes from being promoted to duck arrays, similar to what #9042 did for datetimes. (Basically everything except Categoricals, which seem to be the one extension type with good coverage in the xarray test suite, and which don't support the vast majority of
ufunc
s regardless.) That would at least allow people to safely continue usingto_xarray()
on modern versions of pandas, though you'd lose all the speed & type safety that @ilan-gold worked to achieve in 2024.5 & onward.Environment
INSTALLED VERSIONS
commit: None
python: 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-372.32.1.el8_6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: None
xarray: 2025.3.1
pandas: 2.2.3
numpy: 1.26.4
scipy: 1.15.2
netCDF4: None
pydap: None
h5netcdf: 1.6.1
h5py: 3.9.0
zarr: 3.0.6
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.3.0
distributed: 2025.3.0
matplotlib: 3.10.1
cartopy: None
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2024.9.0
cupy: 13.4.0
pint: None
sparse: 0.16.0
flox: None
numpy_groupies: None
setuptools: 78.1.0
pip: 25.0.1
conda: None
pytest: 8.3.5
mypy: 1.15.0
IPython: 8.35.0
sphinx: None
The text was updated successfully, but these errors were encountered: