BUG: groupby.apply respects as_index=False if and only if group_keys=True #57656

mvashishtha · 2024-02-28T01:41:57Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({'A': [7, -1, 4, 5], 'B': [10, 4, 2, 8]}, index= pd.Index(['i3', 'i2', 'i1', 'i0'], name='i0'))

################################
# For transforms, like lambda x: x
################################

# when group_keys=True, apply() respects as_index=False. same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=True).apply(lambda x: x, include_groups=False))
print(df.groupby('A', as_index=False, group_keys=True).apply(lambda x: x, include_groups=False))

# when group_keys=False, apply() does not respect as_index=False. same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=False).apply(lambda x: x, include_groups=False))
print(df.groupby('A', as_index=False, group_keys=False).apply(lambda x: x, include_groups=False))

################################
# For non-transform lambda x: pd.DataFrame([x.iloc[0].sum()])
################################

# when group_keys=True, grouping by data column respects as_index=False.  same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', as_index=False, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))

# when group_keys=False, grouping by data column does not respect as_index=False.  same is true when grouping by 'i0' or by ['i0', 'A']
print(df.groupby('A', as_index=True, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', as_index=False, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))

Issue Description

groupby.apply respects as_index=False if and only if group_keys=True, but the documentation suggests that it should only respect as_index if group_keys=False.

My apologies in advance if I'm duplicating an issue or misunderstanding the intended behavior here. I know there has been some relevant discussion in #49543.

Expected Behavior

I don't know what the correct behavior is here. A simple and easily explainable behavior would be to always respect as_index=False. However, to be consistent with the documentation here, transform-like applies should never respect as_index=False, and I suppose that non-transform-like applies should respect it:

Since transformations do not include the groupings that are used to split the result, the arguments as_index and sort in DataFrame.groupby() and Series.groupby() have no effect.

When group_keys=True, the result does include the "groupings that are used to split the result", so for the same reason that this note gives, as_index should have no effect. The current behavior is the opposite, though: as_index has an effect only when group_keys=True. (despite the description of group_keys, it appears that apply includes the group keys in the index if and only if group_keys=False, regardless of whether func is a transform.)

Installed Versions

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.9.18.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.3.0
Version               : Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.1
numpy                 : 1.26.3
pytz                  : 2023.3.post1
dateutil              : 2.8.2
setuptools            : 68.2.2
pip                   : 23.3.1
Cython                : None
pytest                : None
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : None
IPython               : 8.18.1
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : None
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2023.4
qtpy                  : None
pyqt5                 : None

The text was updated successfully, but these errors were encountered:

mvashishtha · 2024-02-28T17:08:46Z

sort interaction with group_keys is also confusing, but different: ~~for transforms, we only sort if sort=True, group_keys=True, and in particular we do not sort if sort=True, group_keys=False.~~ For non-transforms, apply() sorts if sort=True, regardless of the value of group_keys.

import pandas as pd

df = pd.DataFrame({'A': [7, -1, 4, 5], 'B': [10, 4, 2, 8]}, index= pd.Index(['i3', 'i2', 'i1', 'i0'], name='i0'))

################################
# For transforms, like lambda x: x
################################

# when group_keys=True, apply() sorts if and only if sort=True as well.
print(df.groupby('A', sort=True, group_keys=True).apply(lambda x: x, include_groups=False))
print(df.groupby('A', sort=False, group_keys=True).apply(lambda x: x, include_groups=False))

# when group_keys=False, never sort.
print(df.groupby('A', sort=True, group_keys=False).apply(lambda x: x, include_groups=False))
print(df.groupby('A', sort=False, group_keys=False).apply(lambda x: x, include_groups=False))

################################
# For non-transform lambda x: pd.DataFrame([x.iloc[0].sum()])
################################

# when group_keys=True, apply() respects sort=True and sort=False.
print(df.groupby('A', sort=True, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', sort=False, group_keys=True).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))

# when group_keys=False, apply() respects sort=True and sort=False.
print(df.groupby('A', sort=True, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))
print(df.groupby('A', sort=False, group_keys=False).apply(lambda x: pd.DataFrame([x.iloc[0].sum()]), include_groups=False))

edit: see below for comment about sort behavior for transforms under group_keys=True and group_keys=False.

mvashishtha · 2024-02-29T01:05:36Z

correction for sort behavior of transforms:

Rather than following following the usual interpretation of either sort=True or sort=False, it seems that when group_keys=False, we reindex back to the original dataframe order. so we return a dataframe with the same exact index as the original. OTOH, when group_keys=True, sort=True really means sort=True, and sort=False really means sort=False. My above examples don't capture this because the keys are unique. but this does:

import pandas as pd

df = pd.DataFrame({'A': [7, -1, 4, 7], 'B': [10, 4, 2, 8]}, index= pd.Index(['i3', 'i2', 'i1', 'i0'], name='i0'))

################################
# For transforms, like lambda x: x
################################

# when group_keys=True, sort means the usual thing: sort = True means sort by values of group keys. sort = False
# means sort by order of appearance of group keys.
print(df.groupby('A', sort=True, group_keys=True).apply(lambda x: x, include_groups=False))
print(df.groupby('A', sort=False, group_keys=True).apply(lambda x: x, include_groups=False))

# when group_keys=False, reindex result to the index of the original dataframe. sort param has no effect.
print(df.groupby('A', sort=True, group_keys=False).apply(lambda x: x, include_groups=False))
print(df.groupby('A', sort=False, group_keys=False).apply(lambda x: x, include_groups=False))

rhshadrach · 2024-03-01T22:42:59Z

@mvashishtha - would you be able to condense this back into the OP?

mvashishtha added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 28, 2024

mvashishtha changed the title ~~BUG: groupby.apply respects as_index=False for transforms when group_keys=True~~ BUG: groupby.apply respects as_index=False if and only if group_keys=True Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.apply respects as_index=False if and only if group_keys=True #57656

BUG: groupby.apply respects as_index=False if and only if group_keys=True #57656

mvashishtha commented Feb 28, 2024 •

edited

Loading

mvashishtha commented Feb 28, 2024 •

edited

Loading

mvashishtha commented Feb 29, 2024

rhshadrach commented Mar 1, 2024

BUG: groupby.apply respects as_index=False if and only if group_keys=True #57656

BUG: groupby.apply respects as_index=False if and only if group_keys=True #57656

Comments

mvashishtha commented Feb 28, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

mvashishtha commented Feb 28, 2024 • edited Loading

mvashishtha commented Feb 29, 2024

rhshadrach commented Mar 1, 2024

mvashishtha commented Feb 28, 2024 •

edited

Loading

mvashishtha commented Feb 28, 2024 •

edited

Loading