Skip to content

[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -157,3 +157,5 @@ jobs:
run: |
source activate pandas-dev
pytest pandas/tests/frame/methods --array-manager
pytest pandas/tests/frame/indexing/test_indexing.py::TestDataFrameIndexing::test_setitem_boolean --array-manager
pytest pandas/tests/frame/indexing/test_where.py --array-manager
4 changes: 4 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,10 @@ def _as_manager(self, typ: str) -> DataFrame:
# fastpath of passing a manager doesn't check the option/manager class
return DataFrame(new_mgr)

@property
def _has_array_manager(self):
return isinstance(self._mgr, ArrayManager)

# ----------------------------------------------------------------------

@property
Expand Down
32 changes: 31 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -8801,6 +8801,13 @@ def _where(
if axis is not None:
axis = self._get_axis_number(axis)

cond_orig = cond

# Needed for DataFrames with ArrayManager, see below for details
all_bool_columns = False
if isinstance(cond, ABCDataFrame) and cond._has_array_manager:
all_bool_columns = all(is_bool_dtype(dt) for dt in cond.dtypes)

# align the cond to same shape as myself
cond = com.apply_if_callable(cond, self)
if isinstance(cond, NDFrame):
Expand All @@ -8812,9 +8819,32 @@ def _where(
raise ValueError("Array conditional must be same shape as self")
cond = self._constructor(cond, **self._construct_axes_dict())

# Needed for DataFrames with ArrayManager, see below for details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to just do this inside the manager itself? I know this might mean a slight refactoring here, but it would be better if we could do that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically this code does two steps for cond: 1) align and 2) fillna.
The way that I wrote this now, those are a bit intertwined (I check something about the dataframe that is need for fillna before the alignment step), but I don't think we want to push the alignment inside the manager? (that's something that now is always done outside I think, the manager doesn't need to care about that)

As mentioned below, I could call cond._mgr.fillna(..) instead of cond.fillna(), and then add a keyword whether to force it to be boolean or not (what is now encoded in all_bool_columns).
But, in that idea, the code to determine all_bool_columns would still live here, and thus would only move part of the added complexity into the internals.

if (
isinstance(cond, ABCDataFrame)
and cond._has_array_manager
and isinstance(cond_orig, ABCSeries)
):
all_bool_columns = is_bool_dtype(cond_orig.dtype)

# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
try:
cond = cond.fillna(fill_value)
except TypeError:
# With ArrayManager, fillna can raise an error if `cond` is not
# of boolean dtype
raise ValueError("Boolean array expected for the condition")

# With ArrayManager, `fillna` does not automatically change object dtype
# back to bools (if the alignment made it object by introducing NaNs).
# So in this case we cast back to bool manually *if* the original columns
# before aligning were bool
# TODO this workaround can be removed once we have nullable boolean dtype
# as default
if isinstance(cond, ABCDataFrame) and cond._has_array_manager:
if not all(is_bool_dtype(dt) for dt in cond.dtypes) and all_bool_columns:
cond = cond.astype(bool)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a viable way to put these patches in with the ArrayManager code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't directly see one, as we don't want to change (IMO) the fillna behaviour.
Unless we we would add a keyword for this (only for internal use) to the manager fillna method, and call cond._mgr.fillna here instead of cond.fillna to be able to pass it through, but not sure that would be any cleaner (as long as it is only used once, at least, there might come up other places where this is needed later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left same comment above. I think we are going to want to push more code to the manager (kind of the opposite that we have been doing of late though).

msg = "Boolean array expected for the condition, not {dtype}"

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -454,7 +454,7 @@ def is_mixed_type(self) -> bool:

@property
def is_numeric_mixed_type(self) -> bool:
return False
return all(is_numeric_dtype(t) for t in self.get_dtypes())

@property
def any_extension_types(self) -> bool:
Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/frame/indexing/test_where.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def test_where_set(self, where_frame, float_string_frame):

def _check_set(df, cond, check_dtypes=True):
dfi = df.copy()
econd = cond.reindex_like(df).fillna(True)
econd = cond.reindex_like(df).fillna(True).astype(bool)
expected = dfi.mask(~econd)

return_value = dfi.where(cond, np.nan, inplace=True)
Expand Down Expand Up @@ -499,6 +499,7 @@ def test_where_axis(self):
assert return_value is None
tm.assert_frame_equal(result, expected)

def test_where_axis_multiple_dtypes(self):
# Multiple dtypes (=> multiple Blocks)
df = pd.concat(
[
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/frame/methods/test_fillna.py
Original file line number Diff line number Diff line change
Expand Up @@ -544,3 +544,14 @@ def test_fillna_nonconsolidated_frame():
df_nonconsol = df.pivot("i1", "i2")
result = df_nonconsol.fillna(0)
assert result.isna().sum().sum() == 0


def test_fillna_infer_bool_dtype(using_array_manager):
# With ArrayManager, fillna doesn't change/infer dtypes
df = DataFrame([[True, False], [np.nan, True], [False, None]], dtype=object)
result = df.fillna(True)
if using_array_manager:
expected = DataFrame([[True, False], [True, True], [False, True]], dtype=object)
else:
expected = DataFrame([[True, False], [True, True], [False, True]], dtype=bool)
tm.assert_frame_equal(result, expected)