Skip to content

QST: best practices to combine groupby rolling and apply #55681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
randomgambit opened this issue Oct 25, 2023 · 3 comments
Closed
2 tasks done

QST: best practices to combine groupby rolling and apply #55681

randomgambit opened this issue Oct 25, 2023 · 3 comments
Labels
Apply Apply, Aggregate, Transform, Map Usage Question Window rolling, ewma, expanding

Comments

@randomgambit
Copy link

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/76969408/how-to-combine-groupby-rolling-and-apply-in-pandas

Question about pandas

Hello there,

Apologies if this turns out to be very simple, but I confess I am not quite sure what are the current best practices (with the latest Pandas version) to do a groupby.rolling.apply() operation

Typical use case would be to use pandas qcut() (or any function which does not have a native rolling(). version implemented, contrary to rolling.max() for instance) in a rolling fashion for each group in the dataframe.

Below is a concrete example where I use .apply() with a simple plus one operation (which does not need the rolling part but it reduces the complexity of the example). Note that the groups are already correctly ordered by time (if you were to print each iterable in dd.groupby('group') the observations would be ordered by time)

dd = pd.DataFrame({'mynum' : [1,2,3,4,4,5,3],
                   'time' : [1,1,1,2,3,2,2],
                   'group': ['a', 'c','b', 'a','a' ,'b','c']})

Out[82]: 
   mynum  time group
0      1     1     a
1      2     1     c
2      3     1     b
3      4     2     a
4      4     3     a
5      5     2     b
6      3     2     c

#iloc[-1] is necessary as we need to return just one number per row
dd.groupby('group', as_index = False).rolling(2).mynum.apply(lambda x: (x+1).iloc[-1])
Out[89]: 
group   
a      0    NaN
       3    5.0
       4    5.0
b      2    NaN
       5    6.0
c      1    NaN
       6    4.0
Name: mynum, dtype: float64

but creating a new variable generates issues

dd['var'] = dd.groupby('group', as_index = False).rolling(2).mynum.apply(lambda x: (x+1).iloc[-1])
TypeError: incompatible index of inserted column with frame index

while trying to be smarter and extracting the values (as suggested in the SO question) creates a silent wrong realignment.

dd['wrongvar'] = dd.groupby('group', as_index = False).rolling(2).mynum.apply(lambda x: (x+1).iloc[-1]).values

dd.sort_values(by = 'group')
Out[102]: 
   mynum  time group  wrongvar
0      1     1     a       NaN
3      4     2     a       NaN
4      4     3     a       6.0
2      3     1     b       5.0
5      5     2     b       NaN
1      2     1     c       5.0
6      3     2     c       4.0

What am I supposed to do here?
Thank you so much for your help!

@randomgambit randomgambit added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Oct 25, 2023
@rhshadrach
Copy link
Member

dd = pd.DataFrame({'mynum' : [1,2,3,4,4,5,3],
                   'time' : [1,1,1,2,3,2,2],
                   'group': ['a', 'c','b', 'a','a' ,'b','c']})
dd['var'] = (
    dd.groupby('group', sort=False)
    .rolling(2)
    .mynum
    .apply(lambda x: (x+1).iloc[-1])
    .reset_index('group', drop=True)
)
print(dd)
#    mynum  time group  var
# 0      1     1     a  NaN
# 1      2     1     c  NaN
# 2      3     1     b  NaN
# 3      4     2     a  5.0
# 4      4     3     a  5.0
# 5      5     2     b  6.0
# 6      3     2     c  4.0

Is that the expected result?

@rhshadrach
Copy link
Member

Somewhat related: #31007 (comment)

Though the OP is using .apply, they could be using .agg. This is another case where it seems best if we adhere to the semantics of a transform in groupby and not add the groups to the result's index.

@randomgambit
Copy link
Author

randomgambit commented Oct 25, 2023

very interesting, thanks @rhshadrach! I see now: actually reset_index() is the way to go as it drops the outer index of the groupby dataframe (the actual grouping variable) while keeping the original index values (here, ordered from 0 to 6). So some correct realignment still occurs with reset_index() while nothing happens with .values (and the ordering ends up being incorrect).

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Usage Question Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

2 participants