Skip to content

ENH: Sorting of ExtensionArrays #19957

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Mar 22, 2018
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import numpy as np

from pandas.errors import AbstractMethodError
from pandas.compat.numpy import function as nv

_not_implemented_message = "{} does not implement {}."

Expand Down Expand Up @@ -236,6 +237,52 @@ def isna(self):
"""
raise AbstractMethodError(self)

def _values_for_argsort(self):
# type: () -> ndarray
"""Get the ndarray to be passed to np.argsort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? shouldn't this one of our myriad of _values methods/properties here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ndarray_valaues seems like more of an internal thing, no? I don't know enough to say whether the _ndarray_values appropriate for our current uses (mostly indexing IIRC) are also appropriate for argsort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is what is the point of overriding this specific one? why is not a general purpose EA method/property used here. The proliferation of methods properties is really troublesome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone wants to override argsort great. but providing an indirect mechism is really kludgey.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is what is the point of overriding this specific one?

This PR is implementing argsort. I could see a similar pattern for factorize.

The proliferation of methods properties is really troublesome.

How so?

but providing an indirect mechism is really kludgey.

What's kudgey about it? I think the common case will be overriding the values provided to np.argsort, not overriding the sorting algorithm itself. This is true for Categorical and IPArray, and will be true for Period and probably others.

Without _values_for_argsort we, and 3rd party libraries, will have duplicate code for validating keyword arguments and the ascending kwarg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily want to convert into extension arrays into the same NumPy array used for np.asarray().

For example, the IP Address extension array probably wants to convert into numpy object array of IPAddress object. But for sorting, it could just return numpy structured array with a few integer fields, which will be much faster for comparisons than Python IPAddress objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. In the IPArray case, there's a numpy array, so _values_for_argsort is just that array.

OR simply call np.asarray() if you actually need an ndarray.

Then we're back in the duplicate code situation. That's OK for the base class, but Categorical, Period, and Interval will end up re-implementing argsort from scratch. With _values_for_argsort, it's just a matter of constructing that array (codes, ordinals, concat left & right).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the name _values_for_argsort, it's possible that we'll find other uses for it, in which case we just change the name.

Do you know if a simple array appropriate for arg-sorting is also appropriate for factorize, joins, indexing, etc? I'm not sure ahead of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if a simple array appropriate for arg-sorting is also appropriate for factorize, joins, indexing, etc?

Well, we can scratch factorize / groupby off. Categorical defines _codes_for_groupby separately from its ndarray_values (codes)

def _codes_for_groupby(self, sort):

So we can't just use one array for everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so then let's pick a name change _codes_for_groupby. in this refactor we want to find other usecases and fix our code now rather than later.

something like: _int_mapping_for_values


This is called from within 'ExtensionArray.argsort'.

Returns
-------
values : ndarray
"""
return np.array(self)

def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
"""Returns the indices that would sort this array.

Parameters
----------
ascending : bool, default True
Whether the indices should result in an ascending
or descending sort.
kind : {'quicksort', 'mergesort', 'heapsort'}, optional
Sorting algorithm.
args, kwargs:
passed through to :func:`numpy.argsort`.

Returns
-------
index_array : ndarray
Array of indices that sort ``self``.

See Also
--------
numpy.argsort
"""
# Implementor note: You have two places to override the behavior of
# argsort.
# 1. _values_for_argsort : construct the values passed to np.argsort
# 2. argsort : total control over sorting.

ascending = nv.validate_argsort_with_ascending(ascending, args, kwargs)
values = self._values_for_argsort()
result = np.argsort(values, kind=kind, **kwargs)
if not ascending:
result = result[::-1]
return result

# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
Expand Down
53 changes: 38 additions & 15 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1390,17 +1390,24 @@ def check_for_ordered(self, op):
"you can use .as_ordered() to change the "
"Categorical to an ordered one\n".format(op=op))

def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
"""
Returns the indices that would sort the Categorical instance if
'sort_values' was called. This function is implemented to provide
compatibility with numpy ndarray objects.
def _values_for_argsort(self):
return self._codes.copy()

While an ordering is applied to the category values, arg-sorting
in this context refers more to organizing and grouping together
based on matching category values. Thus, this function can be
called on an unordered Categorical instance unlike the functions
'Categorical.min' and 'Categorical.max'.
def argsort(self, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this changes our opinion on _values_for_argsort, but the apparently Python2 has issues with passing through the arguments correctly to the super() call.

____________________ TestCategoricalSort.test_numpy_argsort ____________________

self = <pandas.tests.categorical.test_sorting.TestCategoricalSort object at 0x7efcb391f950>

    def test_numpy_argsort(self):
        c = Categorical([5, 3, 1, 4, 2], ordered=True)
    
        expected = np.array([2, 4, 1, 3, 0])
>       tm.assert_numpy_array_equal(np.argsort(c), expected,
                                    check_dtype=False)

pandas/tests/categorical/test_sorting.py:26: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../miniconda3/envs/pandas/lib/python2.7/site-packages/numpy/core/fromnumeric.py:886: in argsort
    return argsort(axis, kind, order)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = [5, 3, 1, 4, 2]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
ascending = -1, kind = 'quicksort', args = (None,), kwargs = {}

    def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
        """
            Returns the indices that would sort the Categorical instance if
            'sort_values' was called. This function is implemented to provide
            compatibility with numpy ndarray objects.
    
            While an ordering is applied to the category values, arg-sorting
            in this context refers more to organizing and grouping together
            based on matching category values. Thus, this function can be
            called on an unordered Categorical instance unlike the functions
            'Categorical.min' and 'Categorical.max'.
    
            Returns
            -------
            argsorted : numpy array
    
            See also
            --------
            numpy.ndarray.argsort
            """
        # Keep the implementation here just for the docstring.
        return super(Categorical, self).argsort(ascending=ascending, kind=kind,
>                                               *args, **kwargs)
E       TypeError: argsort() got multiple values for keyword argument 'ascending'

Changing the Categorical.argsort to accept just *args, **kwargs fixes things, since ExtensionArray does the argument validation, but it's a bit unfortunate.

# TODO(PY2): use correct signature
# We have to do *args, **kwargs to avoid a a py2-only signature
# issue since np.argsort differs from argsort.
"""Return the indicies that would sort the Categorical.

Parameters
----------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these doc-strings meet the new standards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7bbe796 does, aside from examples which isn't really possible.

ascending : bool, default True
Whether the indices should result in an ascending
or descending sort.
kind : {'quicksort', 'mergesort', 'heapsort'}, optional
Sorting algorithm.
args, kwargs:
passed through to :func:`numpy.argsort`.

Returns
-------
Expand All @@ -1409,12 +1416,28 @@ def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
See also
--------
numpy.ndarray.argsort

Notes
-----
While an ordering is applied to the category values, arg-sorting
in this context refers more to organizing and grouping together
based on matching category values. Thus, this function can be
called on an unordered Categorical instance unlike the functions
'Categorical.min' and 'Categorical.max'.

Examples
--------
>>> pd.Categorical(['b', 'b', 'a', 'c']).argsort()
array([2, 0, 1, 3])

>>> cat = pd.Categorical(['b', 'b', 'a', 'c'],
... categories=['c', 'b', 'a'],
... ordered=True)
>>> cat.argsort()
array([3, 0, 1, 2])
"""
ascending = nv.validate_argsort_with_ascending(ascending, args, kwargs)
result = np.argsort(self._codes.copy(), kind=kind, **kwargs)
if not ascending:
result = result[::-1]
return result
# Keep the implementation here just for the docstring.
return super(Categorical, self).argsort(*args, **kwargs)

def sort_values(self, inplace=False, ascending=True, na_position='last'):
""" Sorts the Categorical by category value returning a new
Expand Down
40 changes: 40 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,43 @@ def test_count(self, data_missing):
def test_apply_simple_series(self, data):
result = pd.Series(data).apply(id)
assert isinstance(result, pd.Series)

def test_argsort(self, data_for_sorting):
result = pd.Series(data_for_sorting).argsort()
expected = pd.Series(np.array([2, 0, 1], dtype=np.int64))
self.assert_series_equal(result, expected)

def test_argsort_missing(self, data_missing_for_sorting):
result = pd.Series(data_missing_for_sorting).argsort()
expected = pd.Series(np.array([1, -1, 0], dtype=np.int64))
self.assert_series_equal(result, expected)

@pytest.mark.parametrize('ascending', [True, False])
def test_sort_values(self, data_for_sorting, ascending):
ser = pd.Series(data_for_sorting)
result = ser.sort_values(ascending=ascending)
expected = ser.iloc[[2, 0, 1]]
if not ascending:
expected = expected[::-1]

self.assert_series_equal(result, expected)

@pytest.mark.parametrize('ascending', [True, False])
def test_sort_values_missing(self, data_missing_for_sorting, ascending):
ser = pd.Series(data_missing_for_sorting)
result = ser.sort_values(ascending=ascending)
if ascending:
expected = ser.iloc[[2, 0, 1]]
else:
expected = ser.iloc[[0, 2, 1]]
self.assert_series_equal(result, expected)

@pytest.mark.parametrize('ascending', [True, False])
def test_sort_values_frame(self, data_for_sorting, ascending):
df = pd.DataFrame({"A": [1, 2, 1],
"B": data_for_sorting})
result = df.sort_values(['A', 'B'])
expected = pd.DataFrame({"A": [1, 1, 2],
'B': data_for_sorting.take([2, 0, 1])},
index=[2, 0, 1])
self.assert_frame_equal(result, expected)
12 changes: 12 additions & 0 deletions pandas/tests/extension/category/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,18 @@ def data_missing():
return Categorical([np.nan, 'A'])


@pytest.fixture
def data_for_sorting():
return Categorical(['A', 'B', 'C'], categories=['C', 'A', 'B'],
ordered=True)


@pytest.fixture
def data_missing_for_sorting():
return Categorical(['A', None, 'B'], categories=['B', 'A'],
ordered=True)


@pytest.fixture
def na_value():
return np.nan
Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/extension/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,26 @@ def all_data(request, data, data_missing):
return data_missing


@pytest.fixture
def data_for_sorting():
"""Length-3 array with a known sort order.

This should be three items [B, C, A] with
A < B < C
"""
raise NotImplementedError


@pytest.fixture
def data_missing_for_sorting():
"""Length-3 array with a known sort order.

This should be three items [B, NA, A] with
A < B and NA missing.
"""
raise NotImplementedError


@pytest.fixture
def na_cmp():
"""Binary operator for comparing NA values.
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ def isna(self):
return np.array([x.is_nan() for x in self.values])

def take(self, indexer, allow_fill=True, fill_value=None):
indexer = np.asarray(indexer)
mask = indexer == -1

indexer = _ensure_platform_int(indexer)
Expand Down
43 changes: 35 additions & 8 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,20 @@ def data_missing():
return DecimalArray([decimal.Decimal('NaN'), decimal.Decimal(1)])


@pytest.fixture
def data_for_sorting():
return DecimalArray([decimal.Decimal('1'),
decimal.Decimal('2'),
decimal.Decimal('0')])


@pytest.fixture
def data_missing_for_sorting():
return DecimalArray([decimal.Decimal('1'),
decimal.Decimal('NaN'),
decimal.Decimal('0')])


@pytest.fixture
def na_cmp():
return lambda x, y: x.is_nan() and y.is_nan()
Expand All @@ -35,19 +49,32 @@ def na_value():
return decimal.Decimal("NaN")


class TestDtype(base.BaseDtypeTests):
class BaseDecimal(object):
@staticmethod
def assert_series_equal(left, right, *args, **kwargs):

left_na = left.isna()
right_na = right.isna()

tm.assert_series_equal(left_na, right_na)
return tm.assert_series_equal(left[~left_na],
right[~right_na],
*args, **kwargs)


class TestDtype(BaseDecimal, base.BaseDtypeTests):
pass


class TestInterface(base.BaseInterfaceTests):
class TestInterface(BaseDecimal, base.BaseInterfaceTests):
pass


class TestConstructors(base.BaseConstructorsTests):
class TestConstructors(BaseDecimal, base.BaseConstructorsTests):
pass


class TestReshaping(base.BaseReshapingTests):
class TestReshaping(BaseDecimal, base.BaseReshapingTests):

def test_align(self, data, na_value):
# Have to override since assert_series_equal doesn't
Expand Down Expand Up @@ -88,15 +115,15 @@ def test_align_frame(self, data, na_value):
assert e2.loc[0, 'A'].is_nan()


class TestGetitem(base.BaseGetitemTests):
class TestGetitem(BaseDecimal, base.BaseGetitemTests):
pass


class TestMissing(base.BaseMissingTests):
class TestMissing(BaseDecimal, base.BaseMissingTests):
pass


class TestMethods(base.BaseMethodsTests):
class TestMethods(BaseDecimal, base.BaseMethodsTests):
@pytest.mark.parametrize('dropna', [True, False])
@pytest.mark.xfail(reason="value_counts not implemented yet.")
def test_value_counts(self, all_data, dropna):
Expand All @@ -112,7 +139,7 @@ def test_value_counts(self, all_data, dropna):
tm.assert_series_equal(result, expected)


class TestCasting(base.BaseCastingTests):
class TestCasting(BaseDecimal, base.BaseCastingTests):
pass


Expand Down
30 changes: 30 additions & 0 deletions pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,16 @@ def data_missing():
return JSONArray([{}, {'a': 10}])


@pytest.fixture
def data_for_sorting():
return JSONArray([{'b': 1}, {'c': 4}, {'a': 2, 'c': 3}])


@pytest.fixture
def data_missing_for_sorting():
return JSONArray([{'b': 1}, {}, {'c': 4}])


@pytest.fixture
def na_value():
return {}
Expand Down Expand Up @@ -68,6 +78,26 @@ class TestMethods(base.BaseMethodsTests):
def test_value_counts(self, all_data, dropna):
pass

@pytest.mark.skip(reason="Dictionaries are not orderable.")
def test_argsort(self):
pass

@pytest.mark.skip(reason="Dictionaries are not orderable.")
def test_argsort_missing(self):
pass

@pytest.mark.skip(reason="Dictionaries are not orderable.")
def test_sort_values(self):
pass

@pytest.mark.skip(reason="Dictionaries are not orderable.")
def test_sort_values_missing(self):
pass

@pytest.mark.skip(reason="Dictionaries are not orderable.")
def test_sort_values_frame(self):
pass


class TestCasting(base.BaseCastingTests):
pass