-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Add fillna at the beginning of _where not to fill NA. #60729 #60772
base: main
Are you sure you want to change the base?
Conversation
pandas/core/generic.py
Outdated
@@ -9674,6 +9674,13 @@ def _where( | |||
if axis is not None: | |||
axis = self._get_axis_number(axis) | |||
|
|||
# We should not be filling NA. See GH#60729 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types. We may just be missing coverage for the NaN case (which should be added to the test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, @WillAyd .
I thought we could make the values propagate by filling cond
with True
, since _where()
would finally keep the values in self
alive where its cond
is True
.
Even if I don't fill those values here, _where
would call fillna()
using inplace at the below code. That's also why the result varies depending on whether inpalce=True
or not.
Lines 9695 to 9698 in e3b2de8
# make sure we are boolean | |
fill_value = bool(inplace) | |
cond = cond.fillna(fill_value) | |
cond = cond.infer_objects() |
Could you explain in more detail what you mean by propagate for all type? Do you mean we need to keep NA as it is even after this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @WillAyd,
I've done some further investigations on this, but I still belive the current code is the simplest way to make the missing values propagate.
If we want to let NA propagate without calling fillna()
here, there might be too many code changes needed. See below codes :
- Need to change the below code so that we don't fill the missing values when caller is
where()
ormask()
. If we don't,fillna()
will fill them withinplace
.
Lines 9695 to 9698 in f1441b2
# make sure we are boolean | |
fill_value = bool(inplace) | |
cond = cond.fillna(fill_value) | |
cond = cond.infer_objects() |
- Need to change the below code as well since
to_numpy()
will fill the missing value usinginplace
when cond is a DataFrame.
Lines 9703 to 9716 in f1441b2
if not isinstance(cond, ABCDataFrame): | |
# This is a single-dimensional object. | |
if not is_bool_dtype(cond): | |
raise TypeError(msg.format(dtype=cond.dtype)) | |
else: | |
for _dt in cond.dtypes: | |
if not is_bool_dtype(_dt): | |
raise TypeError(msg.format(dtype=_dt)) | |
if cond._mgr.any_extension_types: | |
# GH51574: avoid object ndarray conversion later on | |
cond = cond._constructor( | |
cond.to_numpy(dtype=bool, na_value=fill_value), | |
**cond._construct_axes_dict(), | |
) |
- Since
extract_bool_array()
fills the missing values using argna_value=False
atEABackedBlock.where()
, we might need to find every single NA index from cond before we call this function(using isna() for example) and then implement additional behaviour to make those values propagate atExtensionArray._where()
.
pandas/pandas/core/internals/blocks.py
Lines 1664 to 1668 in f1441b2
def where(self, other, cond) -> list[Block]: | |
arr = self.values.T | |
cond = extract_bool_array(cond) | |
pandas/pandas/core/array_algos/putmask.py
Lines 116 to 127 in f1441b2
def extract_bool_array(mask: ArrayLike) -> npt.NDArray[np.bool_]: | |
""" | |
If we have a SparseArray or BooleanArray, convert it to ndarray[bool]. | |
""" | |
if isinstance(mask, ExtensionArray): | |
# We could have BooleanArray, Sparse[bool], ... | |
# Except for BooleanArray, this is equivalent to just | |
# np.asarray(mask, dtype=bool) | |
mask = mask.to_numpy(dtype=bool, na_value=False) | |
mask = np.asarray(mask, dtype=bool) | |
return mask |
If _where() is trying to fill the missing values for cond anyway, I think we don't necessarily have to disfavour the current code change. Could you give me some feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to fill missing values when NaN is the missing value indicator? I don't think that is right either - the missing values should propogate for all types.
By filling in the missing values on cond
with True, the missing value in the caller propagates. It's not filling in this missing values on cond
that then fails to properly propagate the caller's missing value.
FYI, it seems this has already been discussed at #53124 (comment) |
Co-authored-by: Xiao Yuan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks like the right approach, but a question on if we can simplify here.
pandas/core/generic.py
Outdated
if isinstance(cond, np.ndarray): | ||
cond = np.array(cond) | ||
cond[np.isnan(cond)] = True | ||
elif isinstance(cond, NDFrame): | ||
cond = cond.fillna(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do fillna on L9704 below. Can these be combined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what other types besides ndarray and NDFrame get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what other types besides ndarray and NDFrame get here?
Hi @rhshadrach, thanks for the review!
I've tried to find if there are any other types that possibly can get here, but I couldn't find any.
According to the documentation, cond should be one of these : bool Series
/DataFrame
, array-like
, or callable
.
And array-like
such as list
/tuple
would be converted to NDFrame
/np.ndarray
via below codes.
In case we input list
/tuple
to mask()
:
or In case we input callable
(a function that returns list
or tuple
) to mask()
:
(pandas/core/generic.py > NDFrame.mask())
Lines 10096 to 10098 in 57fd502
# see gh-21891 | |
if not hasattr(cond, "__invert__"): | |
cond = np.array(cond) |
In case we input list
/tuple
to where()
:
or In case we input scalar
to 'mask()' or 'where():
or In case we input callable
(a function that returns list
or tuple
) to where()
:
(pandas/core/generic.py > NDFrame._where())
Lines 9712 to 9717 in 57fd502
else: | |
if not hasattr(cond, "shape"): | |
cond = np.asanyarray(cond) | |
if cond.shape != self.shape: | |
raise ValueError("Array conditional must be same shape as self") | |
cond = self._constructor(cond, **self._construct_axes_dict(), copy=False) |
Please let me know if I'm missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rhshadrach, sorry for the confusion. I just realized that cond
could be either list or tuple when we input a callable to where()
. Will revise the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach,
I've tried to combine this code with fillna(inplace)
at L9732 as you said, but it seems this would result in some test failures since align()
at L9722 sometimes returns an ndarray
with full of np.nan
, and then cond
is supposed to be filled with inplace
(=False
) by L9732. And several tests is current expecting this behaviour as it is. For example, at tests.frame.indexing.test_mask.test_mask_stringdtype[Series]
:
tests.frame.indexing.test_mask.test_mask_stringdtype[Series]
def test_mask_stringdtype(frame_or_series):
# GH 40824
obj = DataFrame(
{"A": ["foo", "bar", "baz", NA]},
index=["id1", "id2", "id3", "id4"],
dtype=StringDtype(),
)
filtered_obj = DataFrame(
{"A": ["this", "that"]}, index=["id2", "id3"], dtype=StringDtype()
)
expected = DataFrame(
{"A": [NA, "this", "that", NA]},
index=["id1", "id2", "id3", "id4"],
dtype=StringDtype(),
)
if frame_or_series is Series:
obj = obj["A"]
filtered_obj = filtered_obj["A"]
expected = expected["A"]
filter_ser = Series([False, True, True, False])
result = obj.mask(filter_ser, filtered_obj)
tm.assert_equal(result, expected)
result
>>>
id1 foo
id2 bar
id3 baz
id4 <NA>
Name: A, dtype: string
expected
>>>
id1 <NA>
id2 this
id3 that
id4 <NA>
Name: A, dtype: string
I suspect the behaviour of align()
at L9722 is not desirable because current code will re-initialize the cond
for these cases and fill cond
with False
regardless of cond
given by users as below.
obj
>>>
id1 foo
id2 bar
id3 baz
id4 <NA>
Name: A, dtype: string
filtered_obj
>>>
id2 this
id3 that
Name: A, dtype: string
filter_ser = pd.Series([False, True, True, False])
filter_ser_2 = pd.Series([False, False, False, False])
filter_ser_3 = pd.Series([True, True, True, True])
result = obj.mask(filter_ser, filtered_obj) # Should return ["foo", "this", "that", pd.NA]. But this test is currently expecthing to be [pd.NA, "this", "that", pd.NA]
result
>>>
id1 <NA>
id2 this
id3 that
id4 <NA>
Name: A, dtype: string
result_2 = obj.mask(filter_ser_2, filtered_obj) # Should return ["foo", "bar", "baz", pd.NA]
result_2
>>>
id1 <NA>
id2 this
id3 that
id4 <NA>
Name: A, dtype: string
result_3 = obj.mask(filter_ser_3, filtered_obj) # Should reutrn ["pd.NA, "this", "that", pd.NA]
result_3
>>>
id1 <NA>
id2 this
id3 that
id4 <NA>
Name: A, dtype: string
I think I'd better open another issue regarding this, but for now, I suppose we'd best to leave fillna()
as it is, not combining with the below one. Could you please let me know what you think about this?
failing tests
tests.frame.indexing.test_getitem.TestGetitemBooleanMask.test_getitem_boolean_series_with_duplicate_columns
tests.frame.indexing.test_indexing.TestDataFrameIndexing.test_setitem_cast
tests.frame.indexing.test_mask.test_mask_stringdtype[Series]
tests.frame.indexing.test_mask.test_mask_where_dtype_timedelta
tests.frame.indexing.test_where.test_where_bool_comparison
tests.frame.indexing.test_where.test_where_string_dtype[Series]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_alignment[float_string]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_alignment[mixed_int]
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_bug
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_invalid
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_ndframe_align
tests.frame.indexing.test_where.TestDataFrameIndexingWhere.test_where_none
tests.indexing.multiindex.test_setitem.TestMultiIndexSetItem.test_frame_getitem_setitem_multislice
tests.indexing.test_indexing.TestMisc.test_no_reference_cycle
tests.series.indexing.test_mask.test_mask_casts
tests.series.indexing.test_where.test_where_error
tests.series.indexing.test_where.test_where_setitem_invalid
|
||
def test_mask_na(): | ||
# We should not be filling pd.NA. See GH#60729 | ||
series = Series([None, 1, 2, None, 3, 4, None], dtype=Int64Dtype()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test for arrow. Can parametrize with e.g.
@pytest.mark.parametrize("dtype", ["Int64", "int64[pyarrow]")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just added a test for pyarrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
pandas/core/generic.py
Outdated
cond[isna(cond)] = True | ||
elif isinstance(cond, NDFrame): | ||
cond = cond.fillna(True) | ||
elif isinstance(cond, (list, tuple)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for where
state that cond
can be "list-like", so we should be using is_list_like
instead of this condition. However, can you instead move this section so that it's combined with the if
block on L9714 immediately below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've change the code, but it seems this causes test failures. I'll convert the status to draft and revise the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirmed that all the tests passed. Thanks!
from pandas import ( | ||
Series, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code reverted. Thanks!
if dtype == "int64[pyarrow]": | ||
pytest.importorskip("pyarrow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, can you change the parametrization to be:
pytest.param("int64[pyarrow]", marks=td.skip_if_no("pyarrow"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. just changed the code.
Many thanks for your feedback, @rhshadrach . All the changes you requested are now reflected. Could you review the changes? |
pandas/core/generic.py
Outdated
@@ -9712,6 +9713,9 @@ def _where( | |||
else: | |||
if not hasattr(cond, "shape"): | |||
cond = np.asanyarray(cond) | |||
else: | |||
cond = np.array(cond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cond = np.array(cond) | |
cond = extract_array(cond, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid the copy
here somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might be able to call fillna(True)
right below the self._constructor()
since cond will become an NDFrame there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mroeschke ,confirmed that all the checks passed. Could you review the code change?
Co-authored-by: Matthew Roeschke <[email protected]>
Hi @mroeschke , Thanks for the suggestion! But it seems some tests fail when we changed the code |
Hi @mroeschke , >>> import pandas as pd
>>> import numpy as np
>>> series = pd.Series([None,1,2,None,3,4,None])
>>> series.where([np.nan, False, False, np.nan, True, True, np.nan], -99)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\sangg\Desktop\pandas-sanggon-from-code\pandas\pandas\core\generic.py", line 10032, in where
return self._where(cond, other, inplace=inplace, axis=axis, level=level)
File "C:\Users\sangg\Desktop\pandas-sanggon-from-code\pandas\pandas\core\generic.py", line 9732, in _where
raise TypeError(msg.format(dtype=cond.dtype))
TypeError: Boolean array expected for the condition, not object I think def _where(
...
if isinstance(cond, NDFrame):
...
else:
if not hasattr(cond, "shape"):
dt = None
if set(cond).issubset({True, False, np.nan}):
dt = bool
cond = np.asanyarray(cond, dtype=dt)
if cond.shape != self.shape:
raise ValueError("Array conditional must be same shape as self") |
def test_mask_na(dtype): | ||
# We should not be filling pd.NA. See GH#60729 | ||
series = Series([None, 1, 2, None, 3, 4, None], dtype=dtype) | ||
result = series.mask(series <= 2, -99) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test the case where the condition is an ndarray and a list.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Added fillna at the beginning of _where so that we can fill pd.NA.
Since this is my first PR, please correct me if I'm mistaken. Thanks!