Skip to content

Commit 840cb1f

Browse files
authored
Fix arrow groupby na (#60777)
* BUG: Fix factorize to ensure proper use of null_encoding parameter * DOC: Add whatsnew entry for dictionary array NA handling fix * BUG: Fix factorize to ensure proper use of null_encoding parameter and backwards compatibility maintained * DOC: Improve rst file and test case comments for arrow groupby NA fix
1 parent d575eea commit 840cb1f

File tree

3 files changed

+19
-1
lines changed

3 files changed

+19
-1
lines changed

doc/source/whatsnew/v3.0.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -790,6 +790,7 @@ ExtensionArray
790790
^^^^^^^^^^^^^^
791791
- Bug in :class:`Categorical` when constructing with an :class:`Index` with :class:`ArrowDtype` (:issue:`60563`)
792792
- Bug in :meth:`.arrays.ArrowExtensionArray.__setitem__` which caused wrong behavior when using an integer array with repeated values as a key (:issue:`58530`)
793+
- Bug in :meth:`ArrowExtensionArray.factorize` where NA values were dropped when input was dictionary-encoded even when dropna was set to False(:issue:`60567`)
793794
- Bug in :meth:`api.types.is_datetime64_any_dtype` where a custom :class:`ExtensionDtype` would return ``False`` for array-likes (:issue:`57055`)
794795
- Bug in comparison between object with :class:`ArrowDtype` and incompatible-dtyped (e.g. string vs bool) incorrectly raising instead of returning all-``False`` (for ``==``) or all-``True`` (for ``!=``) (:issue:`59505`)
795796
- Bug in constructing pandas data structures when passing into ``dtype`` a string of the type followed by ``[pyarrow]`` while PyArrow is not installed would raise ``NameError`` rather than ``ImportError`` (:issue:`57928`)

pandas/core/arrays/arrow/array.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -1208,7 +1208,12 @@ def factorize(
12081208
data = data.cast(pa.int64())
12091209

12101210
if pa.types.is_dictionary(data.type):
1211-
encoded = data
1211+
if null_encoding == "encode":
1212+
# dictionary encode does nothing if an already encoded array is given
1213+
data = data.cast(data.type.value_type)
1214+
encoded = data.dictionary_encode(null_encoding=null_encoding)
1215+
else:
1216+
encoded = data
12121217
else:
12131218
encoded = data.dictionary_encode(null_encoding=null_encoding)
12141219
if encoded.length() == 0:

pandas/tests/extension/test_arrow.py

+12
Original file line numberDiff line numberDiff line change
@@ -3329,6 +3329,18 @@ def test_factorize_chunked_dictionary():
33293329
tm.assert_index_equal(res_uniques, exp_uniques)
33303330

33313331

3332+
def test_factorize_dictionary_with_na():
3333+
# GH#60567
3334+
arr = pd.array(
3335+
["a1", pd.NA], dtype=ArrowDtype(pa.dictionary(pa.int32(), pa.utf8()))
3336+
)
3337+
indices, uniques = arr.factorize(use_na_sentinel=False)
3338+
expected_indices = np.array([0, 1], dtype=np.intp)
3339+
expected_uniques = pd.array(["a1", None], dtype=ArrowDtype(pa.string()))
3340+
tm.assert_numpy_array_equal(indices, expected_indices)
3341+
tm.assert_extension_array_equal(uniques, expected_uniques)
3342+
3343+
33323344
def test_dictionary_astype_categorical():
33333345
# GH#56672
33343346
arrs = [

0 commit comments

Comments
 (0)