Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Add fillna at the beginning of _where not to fill NA. #60729 #60772

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

sanggon6107
Copy link

Added fillna at the beginning of _where so that we can fill pd.NA.
Since this is my first PR, please correct me if I'm mistaken. Thanks!

@@ -9674,6 +9674,13 @@ def _where(
if axis is not None:
axis = self._get_axis_number(axis)

# We should not be filling NA. See GH#60729
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types. We may just be missing coverage for the NaN case (which should be added to the test)

Copy link
Author

@sanggon6107 sanggon6107 Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, @WillAyd .
I thought we could make the values propagate by filling cond with True, since _where() would finally keep the values in self alive where its cond is True.
Even if I don't fill those values here, _where would call fillna() using inplace at the below code. That's also why the result varies depending on whether inpalce=True or not.

pandas/pandas/core/generic.py

Lines 9695 to 9698 in e3b2de8

# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
cond = cond.infer_objects()

Could you explain in more detail what you mean by propagate for all type? Do you mean we need to keep NA as it is even after this line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @WillAyd,

I've done some further investigations on this, but I still belive the current code is the simplest way to make the missing values propagate.
If we want to let NA propagate without calling fillna() here, there might be too many code changes needed. See below codes :

  1. Need to change the below code so that we don't fill the missing values when caller is where() or mask(). If we don't, fillna() will fill them with inplace.

pandas/pandas/core/generic.py

Lines 9695 to 9698 in f1441b2

# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
cond = cond.infer_objects()

  1. Need to change the below code as well since to_numpy() will fill the missing value using inplace when cond is a DataFrame.

pandas/pandas/core/generic.py

Lines 9703 to 9716 in f1441b2

if not isinstance(cond, ABCDataFrame):
# This is a single-dimensional object.
if not is_bool_dtype(cond):
raise TypeError(msg.format(dtype=cond.dtype))
else:
for _dt in cond.dtypes:
if not is_bool_dtype(_dt):
raise TypeError(msg.format(dtype=_dt))
if cond._mgr.any_extension_types:
# GH51574: avoid object ndarray conversion later on
cond = cond._constructor(
cond.to_numpy(dtype=bool, na_value=fill_value),
**cond._construct_axes_dict(),
)

  1. Since extract_bool_array() fills the missing values using arg na_value=False at EABackedBlock.where(), we might need to find every single NA index from cond before we call this function(using isna() for example) and then implement additional behaviour to make those values propagate at ExtensionArray._where().

def where(self, other, cond) -> list[Block]:
arr = self.values.T
cond = extract_bool_array(cond)

def extract_bool_array(mask: ArrayLike) -> npt.NDArray[np.bool_]:
"""
If we have a SparseArray or BooleanArray, convert it to ndarray[bool].
"""
if isinstance(mask, ExtensionArray):
# We could have BooleanArray, Sparse[bool], ...
# Except for BooleanArray, this is equivalent to just
# np.asarray(mask, dtype=bool)
mask = mask.to_numpy(dtype=bool, na_value=False)
mask = np.asarray(mask, dtype=bool)
return mask

If _where() is trying to fill the missing values for cond anyway, I think we don't necessarily have to disfavour the current code change. Could you give me some feedback?

Copy link
Member

@rhshadrach rhshadrach Mar 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd

Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types.

By filling in the missing values on cond with True, the missing value in the caller propagates. It's not filling in this missing values on cond that then fails to properly propagate the caller's missing value.

@sanggon6107
Copy link
Author

FYI, it seems this has already been discussed at #53124 (comment)

@sanggon6107 sanggon6107 requested a review from WillAyd February 3, 2025 12:43
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks like the right approach, but a question on if we can simplify here.

Comment on lines 9678 to 9682
if isinstance(cond, np.ndarray):
cond = np.array(cond)
cond[np.isnan(cond)] = True
elif isinstance(cond, NDFrame):
cond = cond.fillna(True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also do fillna on L9704 below. Can these be combined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what other types besides ndarray and NDFrame get here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what other types besides ndarray and NDFrame get here?

Hi @rhshadrach, thanks for the review!
I've tried to find if there are any other types that possibly can get here, but I couldn't find any.

According to the documentation, cond should be one of these : bool Series/DataFrame, array-like, or callable.
And array-like such as list/tuple would be converted to NDFrame/np.ndarray via below codes.

In case we input list/tuple to mask():
or In case we input callable(a function that returns list or tuple) to mask():
(pandas/core/generic.py > NDFrame.mask())

pandas/pandas/core/generic.py

Lines 10096 to 10098 in 57fd502

# see gh-21891
if not hasattr(cond, "__invert__"):
cond = np.array(cond)

In case we input list/tuple to where():
or In case we input scalar to 'mask()' or 'where():
or In case we input callable(a function that returns list or tuple) to where():
(pandas/core/generic.py > NDFrame._where())

pandas/pandas/core/generic.py

Lines 9712 to 9717 in 57fd502

else:
if not hasattr(cond, "shape"):
cond = np.asanyarray(cond)
if cond.shape != self.shape:
raise ValueError("Array conditional must be same shape as self")
cond = self._constructor(cond, **self._construct_axes_dict(), copy=False)

Please let me know if I'm missing anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rhshadrach, sorry for the confusion. I just realized that cond could be either list or tuple when we input a callable to where(). Will revise the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach,
I've tried to combine this code with fillna(inplace) at L9732 as you said, but it seems this would result in some test failures since align() at L9722 sometimes returns an ndarray with full of np.nan, and then cond is supposed to be filled with inplace(=False) by L9732. And several tests is current expecting this behaviour as it is. For example, at tests.frame.indexing.test_mask.test_mask_stringdtype[Series] :

tests.frame.indexing.test_mask.test_mask_stringdtype[Series]

def test_mask_stringdtype(frame_or_series):
    # GH 40824
    obj = DataFrame(
        {"A": ["foo", "bar", "baz", NA]},
        index=["id1", "id2", "id3", "id4"],
        dtype=StringDtype(),
    )
    filtered_obj = DataFrame(
        {"A": ["this", "that"]}, index=["id2", "id3"], dtype=StringDtype()
    )
    expected = DataFrame(
        {"A": [NA, "this", "that", NA]},
        index=["id1", "id2", "id3", "id4"],
        dtype=StringDtype(),
    )
    if frame_or_series is Series:
        obj = obj["A"]
        filtered_obj = filtered_obj["A"]
        expected = expected["A"]

    filter_ser = Series([False, True, True, False])
    result = obj.mask(filter_ser, filtered_obj)

    tm.assert_equal(result, expected)
result
>>> 
id1     foo
id2     bar
id3     baz
id4    <NA>
Name: A, dtype: string

expected
>>>
id1    <NA>
id2    this
id3    that
id4    <NA>
Name: A, dtype: string

I suspect the behaviour of align() at L9722 is not desirable because current code will re-initialize the cond for these cases and fill cond with False regardless of cond given by users as below.

obj
>>>
id1     foo
id2     bar
id3     baz
id4    <NA>
Name: A, dtype: string
filtered_obj
>>>
id2    this
id3    that
Name: A, dtype: string

filter_ser = pd.Series([False, True, True, False]) 
filter_ser_2 = pd.Series([False, False, False, False])
filter_ser_3 = pd.Series([True, True, True, True])
result = obj.mask(filter_ser, filtered_obj) # Should return ["foo", "this", "that", pd.NA]. But this test is currently expecthing to be [pd.NA, "this", "that", pd.NA]
result
>>>
id1    <NA>
id2    this
id3    that
id4    <NA>
Name: A, dtype: string
result_2 = obj.mask(filter_ser_2, filtered_obj) # Should return ["foo", "bar", "baz", pd.NA]
result_2
>>>
id1    <NA>
id2    this
id3    that
id4    <NA>
Name: A, dtype: string
result_3 = obj.mask(filter_ser_3, filtered_obj) # Should reutrn ["pd.NA, "this", "that", pd.NA]
result_3
>>>
id1    <NA>
id2    this
id3    that
id4    <NA>
Name: A, dtype: string

I think I'd better open another issue regarding this, but for now, I suppose we'd best to leave fillna() as it is, not combining with the below one. Could you please let me know what you think about this?

failing tests

tests.frame.indexing.test_getitem.TestGetitemBooleanMask.test_getitem_boolean_series_with_duplicate_columns
tests.frame.indexing.test_indexing.TestDataFrameIndexing.test_setitem_cast
tests.frame.indexing.test_mask.test_mask_stringdtype[Series]
tests.frame.indexing.test_mask.test_mask_where_dtype_timedelta
tests.frame.indexing.test_where.test_where_bool_comparison
tests.frame.indexing.test_where.test_where_string_dtype[Series]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_alignment[float_string]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_alignment[mixed_int]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_bug
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_invalid
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_ndframe_align
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_none
tests.indexing.multiindex.test_setitem.TestMultiIndexSetItem.test_frame_getitem_setitem_multislice
tests.indexing.test_indexing.TestMisc.test_no_reference_cycle
tests.series.indexing.test_mask.test_mask_casts
tests.series.indexing.test_where.test_where_error
tests.series.indexing.test_where.test_where_setitem_invalid


def test_mask_na():
# We should not be filling pd.NA. See GH#60729
series = Series([None, 1, 2, None, 3, 4, None], dtype=Int64Dtype())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test for arrow. Can parametrize with e.g.

@pytest.mark.parametrize("dtype", ["Int64", "int64[pyarrow]")

Copy link
Author

@sanggon6107 sanggon6107 Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just added a test for pyarrow.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

cond[isna(cond)] = True
elif isinstance(cond, NDFrame):
cond = cond.fillna(True)
elif isinstance(cond, (list, tuple)):
Copy link
Member

@rhshadrach rhshadrach Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for where state that cond can be "list-like", so we should be using is_list_like instead of this condition. However, can you instead move this section so that it's combined with the if block on L9714 immediately below.

Copy link
Author

@sanggon6107 sanggon6107 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've change the code, but it seems this causes test failures. I'll convert the status to draft and revise the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirmed that all the tests passed. Thanks!

Comment on lines 4 to 6
from pandas import (
Series,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you revert this change.

Copy link
Author

@sanggon6107 sanggon6107 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reverted. Thanks!

Comment on lines 76 to 77
if dtype == "int64[pyarrow]":
pytest.importorskip("pyarrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, can you change the parametrization to be:

pytest.param("int64[pyarrow]", marks=td.skip_if_no("pyarrow"))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. just changed the code.

@sanggon6107 sanggon6107 marked this pull request as draft March 4, 2025 16:22
@sanggon6107 sanggon6107 marked this pull request as ready for review March 5, 2025 14:17
@sanggon6107
Copy link
Author

Many thanks for your feedback, @rhshadrach . All the changes you requested are now reflected. Could you review the changes?

@@ -9712,6 +9713,9 @@ def _where(
else:
if not hasattr(cond, "shape"):
cond = np.asanyarray(cond)
else:
cond = np.array(cond)
Copy link
Member

@mroeschke mroeschke Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cond = np.array(cond)
cond = extract_array(cond, extract_numpy=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid the copy here somehow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to call fillna(True) right below the self._constructor() since cond will become an NDFrame there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke ,confirmed that all the checks passed. Could you review the code change?

Co-authored-by: Matthew Roeschke <[email protected]>
@sanggon6107
Copy link
Author

Hi @mroeschke , Thanks for the suggestion! But it seems some tests fail when we changed the code np.array() to extract_array(cond, extract_numpy=True). I'll revert the code for now and see if there is still room for improvement.

@sanggon6107
Copy link
Author

Hi @mroeschke ,
I've found that where() raises when cond is a list which is subset of {False, True, np.nan}.
For example:

>>> import pandas as pd
>>> import numpy as np
>>> series = pd.Series([None,1,2,None,3,4,None])
>>> series.where([np.nan, False, False, np.nan, True, True, np.nan], -99)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\sangg\Desktop\pandas-sanggon-from-code\pandas\pandas\core\generic.py", line 10032, in where
    return self._where(cond, other, inplace=inplace, axis=axis, level=level)
  File "C:\Users\sangg\Desktop\pandas-sanggon-from-code\pandas\pandas\core\generic.py", line 9732, in _where
    raise TypeError(msg.format(dtype=cond.dtype))
TypeError: Boolean array expected for the condition, not object

I think where() raises because np.asanyarray() returns a numpy array with float64 type, and should work without an error since it does work when cond has pd.NA in it.
Do I need to open another issue or fix this here?
My suggestion is basically to specify dtype at L9715 as below.

    def _where(

...

        if isinstance(cond, NDFrame):

...

        else:
            if not hasattr(cond, "shape"):
                dt = None
                if set(cond).issubset({True, False, np.nan}):
                    dt = bool
                cond = np.asanyarray(cond, dtype=dt)
            if cond.shape != self.shape:
                raise ValueError("Array conditional must be same shape as self")

def test_mask_na(dtype):
# We should not be filling pd.NA. See GH#60729
series = Series([None, 1, 2, None, 3, 4, None], dtype=dtype)
result = series.mask(series <= 2, -99)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test the case where the condition is an ndarray and a list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.mask incorrectly replaces positions of pd.NA in the cond argument
5 participants