Skip to content

BUG: Numpy ufuncs e.g. np.[op](df1, df2) aligns columns in pandas 1.2.0 where it did not before #39184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
WhistleWhileYouWork opened this issue Jan 15, 2021 · 10 comments · Fixed by #39239
Closed
2 of 3 tasks
Labels
Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@WhistleWhileYouWork
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

df = pd.DataFrame({k: [1,2,3,4,5] for k in 'abcd'})
np.add(df[['a', 'b']], df[['c', 'd']])

Problem description

This is a regression from pandas 1.1.5 (both versions are using numpy 1.19.5).

Normally if we want to add, subtract, multiply or divide df columns with different names we get NaNs because the column names don't match. E.g:

>>> df[['a', 'b']] + df[['c', 'd']]
    a   b   c   d
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN

To get around this, we would use np.[op](df1, df2).
However, we get the same output as above.

>>> np.add(df[['a', 'b']], df[['c', 'd']])
    a   b   c   d
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN

Expected Output

# Using pandas 1.1.5:
>>> np.add(df[['a', 'b']], df[['c', 'd']])
    a   b
0   2   2
1   4   4
2   6   6
3   8   8
4  10  10

Temporary solution

# This may have a potential copy penalty with the conversion to numpy
>>> df[['a', 'b']] + df[['c', 'd']].to_numpy()
    a   b
0   2   2
1   4   4
2   6   6
3   8   8
4  10  10

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 3e89b4c python : 3.9.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 1.2.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Just my 2 cents: I was more than willing to test this on a nightly / master release, but it doesn't appear you release those. It could be quite beneficial to publish nightlies to PyPl so we don't report issues that have already been fixed. For some, it might be easier to test a nightly than peruse recent open and closed issues.

@WhistleWhileYouWork WhistleWhileYouWork added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 15, 2021
@WhistleWhileYouWork WhistleWhileYouWork changed the title BUG: BUG: Numpy np.[op](df1, df2) regression in pandas 1.2.0 Jan 15, 2021
@jorisvandenbossche
Copy link
Member

@WhistleWhileYouWork thanks for the report.

We "knowingly" changed this behaviour. At least, I commented about this consequence at #36955 (comment), but I didn't follow-up myself to ensure we did it with a deprecation warning instead of breaking change.

We could still do that in 1.2.1, I suppose.

cc @TomAugspurger

Just my 2 cents: I was more than willing to test this on a nightly / master release, but it doesn't appear you release those. It could be quite beneficial to publish nightlies to PyPl so we don't report issues that have already been fixed. For some, it might be easier to test a nightly than peruse recent open and closed issues.

Note that there was a release candidate that could be tested (https://mail.python.org/pipermail/pandas-dev/2020-December/001311.html), and we actually do have nightly packages available to install (https://anaconda.org/scipy-wheels-nightly/pandas), but this might not be very well documented in our installation guide ..

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jan 15, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.2.1 milestone Jan 15, 2021
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 15, 2021

This may have a potential copy penalty with the conversion to numpy

There won't be any penalty over the old behavior. pandas would have done the conversion anyway.

We could still do that in 1.2.1, I suppose.

It is unfortunate that that slipped through. We're all too busy :/

If we're able to restore the old behavior and get the warning in soon, that'd be OK. But the longer we have the new / future behavior on 1.2.x, the more I think we should just keep it, since people will start relying on the new behavior (alignment).

@WhistleWhileYouWork
Copy link
Author

WhistleWhileYouWork commented Jan 15, 2021

Understood, what is the standard accepted practice for accomplishing the previous behavior? Is it:

df[['a', 'b']] + df[['c', 'd']].to_numpy()

I see the previous comment states:

There won't be any penalty over the old behavior.

So that's good news since my main reason for using np ufunc syntax was to avoid any possible penalties.

It would be nice to have the recommended replacement syntax along with the note in the Release Notes about the change in addition to being included with any warnings in code.

we actually do have nightly packages available to install

Ah, thanks. I only checked PyPl. I could not find any reference to the nightly conda packages in any of the usual locations. The README.md would be the most logical place to mention it in my opinion.

Thanks all!

@TomAugspurger
Copy link
Contributor

It would be nice to have the recommended replacement syntax along with the note in the Release Notes about the change in addition to being included with any warnings in code.

Right. The warning and the release notes was recommend .to_numpy() on one of the inputs if you want to avoid alignment (the old / deprecated behavior).

The warning should only occur when they aren't already aligned.

@simonjayhawkins
Copy link
Member

We could still do that in 1.2.1, I suppose.

in the release notes we did have

Calling a binary-input NumPy ufunc on multiple DataFrame objects now aligns, matching the behavior of binary operations and ufuncs on Series (:issue:23743).

so have advertised the change

@WhistleWhileYouWork
Copy link
Author

WhistleWhileYouWork commented Jan 16, 2021

in the release notes we did have

Calling a binary-input NumPy ufunc on multiple DataFrame objects now aligns, matching the behavior of binary operations and ufuncs on Series (:issue:23743).

so have advertised the change

The change was advertised but it is a breaking change, so the suggestion was to walk it back and start with a deprecation warning.

It's possible that my use case was an edge case and there may not be enough users who were impacted. I can certainly live with it now that I know it was an intentional change. At the very least, I suggest guidance in the release notes on the recommended approach to avoiding alignment for those of us who were using NumPy ufuncs specifically to avoid alignment. Something like this:

  • Calling a binary-input NumPy ufunc on multiple DataFrame objects now aligns, matching the behavior of binary operations and ufuncs on Series (:issue:23743). To avoid alignment, convert one of the inputs to a NumPy array (e.g. df.to_numpy()).

@simonjayhawkins
Copy link
Member

The change was advertised but it is a breaking change, so the suggestion was to walk it back and start with a deprecation warning.

we do allow breaking changes under certain circumstances. If we ascertain that the change meets those requirements we could maybe move the release note in 1.2.0 from the other enhancements section with a small section in the breaking API section (with requested guidance). We would then have a brief note in 1.2.1 release notes (other section) that the 1.2.0 release notes have been updated.

I think the inconsistency between Series and DataFrame behaviors maybe enough to justify the breaking change.

@jorisvandenbossche
Copy link
Member

I will check how easy it is to put a deprecation in place.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 17, 2021

I made a PR -> #39239
What are people's thoughts on this?

I personally think we should have done it for 1.2.0 Of course now it is already out, one can question the need to still do it in retrospect. But it seems a possibly somewhat usecase to use numpy's method to avoid alignment in pandas, and then reverting to <1.2.0 behaviour + warning can still avoid breaking code for many users.

But if we want to change it back again, we shouldn't wait with it too long (i.e. in v1.2.1)

@WhistleWhileYouWork
Copy link
Author

WhistleWhileYouWork commented Jan 18, 2021

still avoid breaking code for many users

Yes. Consider how many people have yet to upgrade to 1.2.0 or greater (given that most users skip many releases before upgrading). I myself went from 1.0.3 to 1.2.0. Now consider the few that have already had to adapt to the change (like me). We are not impacted by a reversion to 1.1.5 behavior.

I only discovered the issue because I happened to change the column names immediately after the operation, and the number of columns I was assigning didn't match the number of columns in the aligned output. For others who discover NaNs in their outputs further down the line, it may be much harder to trace back the cause of the issue.

@WhistleWhileYouWork WhistleWhileYouWork changed the title BUG: Numpy np.[op](df1, df2) regression in pandas 1.2.0 BUG: Numpy ufuncs e.g. np.[op](df1, df2) aligns columns in pandas 1.2.0 where it did not before Jan 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants