Skip to content

BUG: groupby.apply respects as_index=False if and only if group_keys=True #57656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
mvashishtha opened this issue Feb 28, 2024 · 3 comments
Open
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@mvashishtha
Copy link

mvashishtha commented Feb 28, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({'A': [7, -1, 4, 5], 'B': [10, 4, 2, 8]}, index= pd.Index(['i3', 'i2', 'i1', 'i0'], name='i0'))

################################
# For transforms, like lambda x: x
################################

# when group_keys=True, apply() respects as_index=False. same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=True).apply(lambda x: x, include_groups=False))
print(df.groupby('A', as_index=False, group_keys=True).apply(lambda x: x, include_groups=False))

# when group_keys=False, apply() does not respect as_index=False. same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=False).apply(lambda x: x, include_groups=False))
print(df.groupby('A', as_index=False, group_keys=False).apply(lambda x: x, include_groups=False))

################################
# For non-transform lambda x: pd.DataFrame([x.iloc[0].sum()])
################################

# when group_keys=True, grouping by data column respects as_index=False.  same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', as_index=False, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))

# when group_keys=False, grouping by data column does not respect as_index=False.  same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', as_index=False, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))

Issue Description

groupby.apply respects as_index=False if and only if group_keys=True, but the documentation suggests that it should only respect as_index if group_keys=False.

My apologies in advance if I'm duplicating an issue or misunderstanding the intended behavior here. I know there has been some relevant discussion in #49543.

Expected Behavior

I don't know what the correct behavior is here. A simple and easily explainable behavior would be to always respect as_index=False. However, to be consistent with the documentation here, transform-like applies should never respect as_index=False, and I suppose that non-transform-like applies should respect it:

Since transformations do not include the groupings that are used to split the result, the arguments as_index and sort in DataFrame.groupby() and Series.groupby() have no effect.

When group_keys=True, the result does include the "groupings that are used to split the result", so for the same reason that this note gives, as_index should have no effect. The current behavior is the opposite, though: as_index has an effect only when group_keys=True. (despite the description of group_keys, it appears that apply includes the group keys in the index if and only if group_keys=False, regardless of whether func is a transform.)

Installed Versions

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.9.18.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.3.0
Version               : Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.1
numpy                 : 1.26.3
pytz                  : 2023.3.post1
dateutil              : 2.8.2
setuptools            : 68.2.2
pip                   : 23.3.1
Cython                : None
pytest                : None
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : None
IPython               : 8.18.1
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : None
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2023.4
qtpy                  : None
pyqt5                 : None
@mvashishtha mvashishtha added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 28, 2024
@mvashishtha mvashishtha changed the title BUG: groupby.apply respects as_index=False for transforms when group_keys=True BUG: groupby.apply respects as_index=False if and only if group_keys=True Feb 28, 2024
@mvashishtha
Copy link
Author

mvashishtha commented Feb 28, 2024

sort interaction with group_keys is also confusing, but different: for transforms, we only sort if sort=True, group_keys=True, and in particular we do not sort if sort=True, group_keys=False. For non-transforms, apply() sorts if sort=True, regardless of the value of group_keys.

import pandas as pd

df = pd.DataFrame({'A': [7, -1, 4, 5], 'B': [10, 4, 2, 8]}, index= pd.Index(['i3', 'i2', 'i1', 'i0'], name='i0'))

################################
# For transforms, like lambda x: x
################################

# when group_keys=True, apply() sorts if and only if sort=True as well.
print(df.groupby('A', sort=True, group_keys=True).apply(lambda x: x, include_groups=False))
print(df.groupby('A', sort=False, group_keys=True).apply(lambda x: x, include_groups=False))

# when group_keys=False, never sort.
print(df.groupby('A', sort=True, group_keys=False).apply(lambda x: x, include_groups=False))
print(df.groupby('A', sort=False, group_keys=False).apply(lambda x: x, include_groups=False))

################################
# For non-transform lambda x: pd.DataFrame([x.iloc[0].sum()])
################################

# when group_keys=True, apply() respects sort=True and sort=False.
print(df.groupby('A', sort=True, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', sort=False, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))

# when group_keys=False, apply() respects sort=True and sort=False.
print(df.groupby('A', sort=True, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', sort=False, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))

edit: see below for comment about sort behavior for transforms under group_keys=True and group_keys=False.

@mvashishtha
Copy link
Author

correction for sort behavior of transforms:

Rather than following following the usual interpretation of either sort=True or sort=False, it seems that when group_keys=False, we reindex back to the original dataframe order. so we return a dataframe with the same exact index as the original. OTOH, when group_keys=True, sort=True really means sort=True, and sort=False really means sort=False. My above examples don't capture this because the keys are unique. but this does:

import pandas as pd

df = pd.DataFrame({'A': [7, -1, 4, 7], 'B': [10, 4, 2, 8]}, index= pd.Index(['i3', 'i2', 'i1', 'i0'], name='i0'))

################################
# For transforms, like lambda x: x
################################

# when group_keys=True, sort means the usual thing: sort = True means sort by values of group keys. sort = False
# means sort by order of appearance of group keys.
print(df.groupby('A', sort=True, group_keys=True).apply(lambda x: x, include_groups=False))
print(df.groupby('A', sort=False, group_keys=True).apply(lambda x: x, include_groups=False))

# when group_keys=False, reindex result to the index of the original dataframe. sort param has no effect.
print(df.groupby('A', sort=True, group_keys=False).apply(lambda x: x, include_groups=False))
print(df.groupby('A', sort=False, group_keys=False).apply(lambda x: x, include_groups=False))

@rhshadrach
Copy link
Member

@mvashishtha - would you be able to condense this back into the OP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants