Skip to content

BUG:Wrong sum in groupby rolling due to precision issues #38752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
AkariGale opened this issue Dec 28, 2020 · 11 comments
Closed
2 of 3 tasks

BUG:Wrong sum in groupby rolling due to precision issues #38752

AkariGale opened this issue Dec 28, 2020 · 11 comments
Labels
Bug Needs Info Clarification about behavior needed to assess issue Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding

Comments

@AkariGale
Copy link

AkariGale commented Dec 28, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import numpy as np
import math
import pandas as pd

id1 = pd.date_range('2019-10-01', '2020-01-08', freq='2H').to_frame(name='stamp').reset_index(drop=True)
id2 = pd.date_range('2019-10-01', '2020-01-08').to_frame(name='stamp').reset_index(drop=True)
id1['id'] = 1
a = 2.5
id1['value'] = (math.e + a) ** (((id1.index + 10) % 40) - 20)
id2['id'] = 2
id2['value'] = np.nan
id2.loc[id2.stamp == '2019-11-15', 'value'] = 10. ** -13

df = pd.concat([id1, id2], ignore_index=True)
df.sort_values('stamp', inplace=True)

roll_sum = df.groupby('id').rolling('93D', on='stamp')['value'].sum().reset_index()
print(roll_sum[roll_sum.id == 2].iloc[-1].value)

roll_sum_2 = df[df.id == 2].groupby('id').rolling('93D', on='stamp')['value'].sum().reset_index()
print(roll_sum_2[roll_sum_2.id == 2].iloc[-1].value)

Problem description

In the code above i show that last values in rollings are not equal:

  • if i count a sum of value using rolling on full dataframe, i obtain 0.015625000000100003
  • If i filter a dataframe and keep only id=2, i obtain 1e-13.

It seems that the aggregation in the second group depends on the first group.

If i replace a on 0, i will get 2.980242e-08 in a first rolling. And if i replace a on 1, i will get a correct answer 1e-13

Expected Output

1e-13

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 3e89b4c4b1580aa890023fc550774e63d499da25
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.3.13-300.fc31.x86_64
Version          : #1 SMP Mon Nov 25 17:25:25 UTC 2019
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.2.0
numpy            : 1.17.4
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 19.1.1
setuptools       : 50.3.0
Cython           : None
pytest           : 5.4.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : 2.10.1
IPython          : None
pandas_datareader: None
bs4              : 4.9.0
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.2.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : 1.3.11
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@AkariGale AkariGale added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 28, 2020
@MarcoGorelli
Copy link
Member

Hi @AkariGale ,

closing as you haven't written anything, but if you fill out the report and ping us we'll reopen

@AkariGale
Copy link
Author

Hi @MarcoGorelli !

Sorry, I accidentally posted an empty report. Сorrected

@MarcoGorelli MarcoGorelli reopened this Dec 28, 2020
@AkariGale AkariGale changed the title BUG: BUG:Wrong sum in rolling Dec 28, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.2.1 milestone Dec 29, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 29, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 29, 2020
@simonjayhawkins
Copy link
Member

first bad commit: [bad52a9] PERF: Use Indexers to implement groupby rolling (#34052) cc @mroeschke

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2020
@mroeschke
Copy link
Member

Thanks for the report. You are correct in that there is influence from the first summation as groupby().rolling() now calculates the result "all at once" as opposed to independent groups for a performance reason.

Since the summation of values is carried across the groups, it appears this example hits numerical precision limits of our summation algorithm as replacing the values in your example with integers yields the correct result

In [13]:
    ...: id1 = pd.date_range('2019-10-01', '2020-01-08', freq='2H').to_frame(name='stamp').reset_index(drop=True)
    ...: id2 = pd.date_range('2019-10-01', '2020-01-08').to_frame(name='stamp').reset_index(drop=True)
    ...: id1['id'] = 1

In [14]: id1['value'] = range(len(id1))

In [15]: id2['id'] = 2

In [16]: id2['value'] = range(len(id2))

In [17]: df_simple = pd.concat([id1, id2], ignore_index=True)

In [18]: df_simple.sort_values('stamp', inplace=True)

In [19]: roll_sum_simple = df_simple.groupby('id').rolling('93D', on='stamp')['value'].sum().reset_index()

In [20]: roll_sum_simple[roll_sum_simple.id == 2].iloc[-1].value
Out[20]: 4929.0

In [21]: roll_sum_simple_2 = df_simple[df_simple.id == 2].groupby('id').rolling('93D', on='stamp')['value'].sum().reset_index()

In [22]: roll_sum_simple_2[roll_sum_simple_2.id ==2 ].iloc[-1].value
Out[22]: 4929.0

@mroeschke mroeschke changed the title BUG:Wrong sum in rolling BUG:Wrong sum in groupby rolling due to precision issues Dec 29, 2020
@phofl
Copy link
Member

phofl commented Jan 2, 2021

Could you provide a smaller example reproducing this? See https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@jreback jreback modified the milestones: 1.2.1, 1.2.2 Jan 16, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.2.2, Contributions Welcome Jan 21, 2021
@simonjayhawkins simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Jan 21, 2021
@MarcoGorelli MarcoGorelli removed the Needs Info Clarification about behavior needed to assess issue label Feb 7, 2021
@MarcoGorelli
Copy link
Member

Hi @simonjayhawkins - any reason to move this off 1.2.2 if it's a regression, and to add "needs info"? I don't disagree, I just don't understand. A smaller reproducible example would be great of course, but isn't there enough info here to reproduce the issue?

@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

@MarcoGorelli 1.2.2 release is today - so this issue is not being fixed rn

@MarcoGorelli
Copy link
Member

Yeah, makes sense, but this was moved off of 1.2.2 17 days ago, so I was just trying to understand the rationale behind that

@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

oh it's not even clear this is actually fixable as it's a deliberate change

@simonjayhawkins
Copy link
Member

Hi @simonjayhawkins - any reason to move this off 1.2.2 if it's a regression, and to add "needs info"? I don't disagree, I just don't understand. A smaller reproducible example would be great of course, but isn't there enough info here to reproduce the issue?

removed milestone as was a regression in 1.1 release, not 1.2. #38752 (comment), #38721 (comment)

(of course if we have a regression fix for older releases, we would consider whether a backport is feasible)

'needs info' was used to indicate awaiting response from OP #38752 (comment)

@mroeschke mroeschke added the Needs Info Clarification about behavior needed to assess issue label Mar 26, 2021
@mroeschke
Copy link
Member

Thanks for the report, but as mentioned I suspect that is is an implementation artifact (numeric precision on the algorithm) that may be impossible to address. Closing, but if you could provide a more minimal example to pinpoint the issue, it would be helpful for the team to see if it could be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Info Clarification about behavior needed to assess issue Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

7 participants