-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Numpy ufuncs e.g. np.[op](df1, df2) aligns columns in pandas 1.2.0 where it did not before #39184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@WhistleWhileYouWork thanks for the report. We "knowingly" changed this behaviour. At least, I commented about this consequence at #36955 (comment), but I didn't follow-up myself to ensure we did it with a deprecation warning instead of breaking change. We could still do that in 1.2.1, I suppose.
Note that there was a release candidate that could be tested (https://mail.python.org/pipermail/pandas-dev/2020-December/001311.html), and we actually do have nightly packages available to install (https://anaconda.org/scipy-wheels-nightly/pandas), but this might not be very well documented in our installation guide .. |
There won't be any penalty over the old behavior. pandas would have done the conversion anyway.
It is unfortunate that that slipped through. We're all too busy :/ If we're able to restore the old behavior and get the warning in soon, that'd be OK. But the longer we have the new / future behavior on 1.2.x, the more I think we should just keep it, since people will start relying on the new behavior (alignment). |
Understood, what is the standard accepted practice for accomplishing the previous behavior? Is it: df[['a', 'b']] + df[['c', 'd']].to_numpy() I see the previous comment states:
So that's good news since my main reason for using np ufunc syntax was to avoid any possible penalties. It would be nice to have the recommended replacement syntax along with the note in the
Ah, thanks. I only checked PyPl. I could not find any reference to the nightly conda packages in any of the usual locations. The Thanks all! |
Right. The warning and the release notes was recommend The warning should only occur when they aren't already aligned. |
in the release notes we did have
so have advertised the change |
The change was advertised but it is a breaking change, so the suggestion was to walk it back and start with a deprecation warning. It's possible that my use case was an edge case and there may not be enough users who were impacted. I can certainly live with it now that I know it was an intentional change. At the very least, I suggest guidance in the release notes on the recommended approach to avoiding alignment for those of us who were using NumPy ufuncs specifically to avoid alignment. Something like this:
|
we do allow breaking changes under certain circumstances. If we ascertain that the change meets those requirements we could maybe move the release note in 1.2.0 from the other enhancements section with a small section in the breaking API section (with requested guidance). We would then have a brief note in 1.2.1 release notes (other section) that the 1.2.0 release notes have been updated. I think the inconsistency between Series and DataFrame behaviors maybe enough to justify the breaking change. |
I will check how easy it is to put a deprecation in place. |
I made a PR -> #39239 I personally think we should have done it for 1.2.0 Of course now it is already out, one can question the need to still do it in retrospect. But it seems a possibly somewhat usecase to use numpy's method to avoid alignment in pandas, and then reverting to <1.2.0 behaviour + warning can still avoid breaking code for many users. But if we want to change it back again, we shouldn't wait with it too long (i.e. in v1.2.1) |
Yes. Consider how many people have yet to upgrade to 1.2.0 or greater (given that most users skip many releases before upgrading). I myself went from 1.0.3 to 1.2.0. Now consider the few that have already had to adapt to the change (like me). We are not impacted by a reversion to 1.1.5 behavior. I only discovered the issue because I happened to change the column names immediately after the operation, and the number of columns I was assigning didn't match the number of columns in the aligned output. For others who discover NaNs in their outputs further down the line, it may be much harder to trace back the cause of the issue. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample
Problem description
This is a regression from pandas 1.1.5 (both versions are using numpy 1.19.5).
Normally if we want to add, subtract, multiply or divide df columns with different names we get NaNs because the column names don't match. E.g:
To get around this, we would use np.[op](df1, df2).
However, we get the same output as above.
Expected Output
Temporary solution
Output of
pd.show_versions()
pandas : 1.2.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
Just my 2 cents: I was more than willing to test this on a nightly / master release, but it doesn't appear you release those. It could be quite beneficial to publish nightlies to PyPl so we don't report issues that have already been fixed. For some, it might be easier to test a nightly than peruse recent open and closed issues.
The text was updated successfully, but these errors were encountered: