Skip to content

ENH: Basis for a StringDtype using Arrow #35259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 91 commits into from
Nov 20, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
4c2e37a
Implement BaseDtypeTests for ArrowStringDtype
xhochy Jul 10, 2020
d477ee7
Implement getitem
xhochy Jul 13, 2020
206f493
Add basic copy implementation
xhochy Jul 13, 2020
d58dba6
Implement getitem for iterables
xhochy Jul 13, 2020
7a9e2c3
Remove commented code
xhochy Jul 13, 2020
ffc4c0f
Implement more Setitem/Getitem variants
xhochy Jul 13, 2020
c1305ab
Review comments by @jorisvandenbossche
xhochy Jul 13, 2020
13a42f7
Add Arrow issue numbers
xhochy Jul 13, 2020
decd022
Adopt to kernel renamings
xhochy Jul 15, 2020
3145e44
Handle take(indices<0, allow_fill=False)
xhochy Jul 15, 2020
e22b348
Handle fill_value better
xhochy Jul 15, 2020
4b8108c
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Oct 19, 2020
2446562
fix doctest
simonjayhawkins Oct 19, 2020
a0dcc85
Revert "fix doctest"
simonjayhawkins Oct 19, 2020
5c42173
change version for versionadded
simonjayhawkins Oct 19, 2020
28c3ef2
code checks
simonjayhawkins Oct 19, 2020
4044d4c
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Oct 21, 2020
1740524
skip tests for pyarrow<1.0
simonjayhawkins Oct 21, 2020
e9bb36f
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Oct 24, 2020
8ad120b
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 2, 2020
34bf57d
raise ImportError in constructors on pyarrow < 1.0.0. or not installed
simonjayhawkins Nov 2, 2020
f92241e
remove size, shape and ndim
simonjayhawkins Nov 2, 2020
c09382d
activate all extension array tests
simonjayhawkins Nov 2, 2020
bac64c1
string array tests
simonjayhawkins Nov 3, 2020
0956147
Update pandas/core/arrays/string_arrow.py
simonjayhawkins Nov 3, 2020
963e1cf
add a to_numpy() method and use from __array__
simonjayhawkins Nov 3, 2020
87b8e67
mypy fixup
simonjayhawkins Nov 3, 2020
1ed0585
remove workaround for ARROW-9407 and ci test on pyarrow=1.0.0
simonjayhawkins Nov 3, 2020
fa954f7
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 4, 2020
82b84bf
add _dtype class attribute
simonjayhawkins Nov 4, 2020
b1a3032
remove redundant integer indexing OOB and negative indexing checks in…
simonjayhawkins Nov 4, 2020
08d34f4
check pyarrow array is string type in constructor
simonjayhawkins Nov 4, 2020
ae49807
basic _from_factorized pending discussion on performant factorisation
simonjayhawkins Nov 4, 2020
2e5d4c7
update constructor error message and move test
simonjayhawkins Nov 4, 2020
c8318cc
add _concat_same_type classmethod
simonjayhawkins Nov 4, 2020
1a200a2
_as_pandas_scalar to method
simonjayhawkins Nov 4, 2020
e10be80
copy/paste fillna from fletcher as baseline (29 failed)
simonjayhawkins Nov 5, 2020
c1d3087
minor cleanup of fillna (29 failed)
simonjayhawkins Nov 5, 2020
34f563d
correct mistake in previous commit (25 failed)
simonjayhawkins Nov 5, 2020
f5fc4fd
add OpsMixin (23 failed)
simonjayhawkins Nov 5, 2020
a5a7c85
add binops (18 failed)
simonjayhawkins Nov 5, 2020
f651563
return Boolean array for comparison ops (12 failed)
simonjayhawkins Nov 5, 2020
f5419b9
fix ValueError: zero-size array to reduction operation maximum which …
simonjayhawkins Nov 5, 2020
3af5ce0
copy/paste value_counts from fletcher as baseline (5 failed)
simonjayhawkins Nov 5, 2020
bdf4ad2
tidy imports
simonjayhawkins Nov 5, 2020
e044c7f
fix test_take_non_na_fill_value (4 failed)
simonjayhawkins Nov 6, 2020
c5625a8
fix test_take_pandas_style_negative_raises (3 failed)
simonjayhawkins Nov 6, 2020
50889fb
parametrize string extension tests (3 failed)
simonjayhawkins Nov 6, 2020
0e1773b
xfail other 2 tests expecting views (1 failed)
simonjayhawkins Nov 6, 2020
7bb9574
add ensure_string_array to _from_sequence (1 failed)
simonjayhawkins Nov 6, 2020
fc45ef7
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 12, 2020
51d7d0a
Apply suggestions from code review
simonjayhawkins Nov 12, 2020
bd76a75
Merge branch 'arrow-string-array' of github.com:xhochy/pandas into ar…
simonjayhawkins Nov 12, 2020
3cf5c91
return NotImplemented in comparisons (7 failed)
simonjayhawkins Nov 12, 2020
07239a0
move arrow function lookup dict to module scope (7 failed)
simonjayhawkins Nov 12, 2020
9a7cfc5
remove isinstance(other, (ABCSeries, ABCDataFrame, ABCIndex)) check
simonjayhawkins Nov 12, 2020
2ba0dcd
remove na_value=cls._dtype.na_value from ensure_string_array call (7 …
simonjayhawkins Nov 13, 2020
97c56e2
coloate _from_sequence_of_strings with _from_sequence (7 failed)
simonjayhawkins Nov 13, 2020
d6d3543
revert change to extra_compile_args in setup.py
simonjayhawkins Nov 13, 2020
ab40dce
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 13, 2020
d71a895
sync fillna docstring with base
simonjayhawkins Nov 13, 2020
f342b62
Apply suggestions from code review
simonjayhawkins Nov 13, 2020
3d05c89
Merge branch 'arrow-string-array' of github.com:xhochy/pandas into ar…
simonjayhawkins Nov 13, 2020
b3c6347
other base.Base*Tests -> super()
simonjayhawkins Nov 13, 2020
26bca25
len(item) == 0 -> not len(item)
simonjayhawkins Nov 13, 2020
9579444
update copy docstring and return type
simonjayhawkins Nov 13, 2020
88094a7
test_constructor_not_string_type_raises with np.ndarray
simonjayhawkins Nov 13, 2020
ba0cee8
update test_from_sequence_no_mutate (7 failed)
simonjayhawkins Nov 13, 2020
6709ac3
change xfail message for base extension array tests (7 failed)
simonjayhawkins Nov 13, 2020
11388b4
change xfail reason message in test_value_counts_na
simonjayhawkins Nov 13, 2020
eb284e7
skip test_memory_usage for ArrowStringArray
simonjayhawkins Nov 13, 2020
27ce19a
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 14, 2020
9b70709
part implementation of na_value in to_numpy
simonjayhawkins Nov 14, 2020
6757feb
remove is_array_like in __getitem__
simonjayhawkins Nov 14, 2020
460ea38
Revert "remove is_array_like in __getitem__"
simonjayhawkins Nov 14, 2020
7bee5e2
remove just is_array_like in __getitem__
simonjayhawkins Nov 14, 2020
91f3763
Update pandas/core/arrays/string_arrow.py
simonjayhawkins Nov 14, 2020
36b662a
Apply suggestions from code review
simonjayhawkins Nov 14, 2020
7a9ef9c
lint fixup
simonjayhawkins Nov 14, 2020
5db8788
xfail test_astype_roundtrip
simonjayhawkins Nov 14, 2020
c76c39f
update expected in test_arrow_array
simonjayhawkins Nov 14, 2020
87b7863
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 15, 2020
24a782d
add fallback for scalar comparison ops
simonjayhawkins Nov 15, 2020
353bff9
dispatch to pyarrow for comparion with np.ndarray (1 failed)
simonjayhawkins Nov 15, 2020
be93947
fix test_reindex_non_na_fill_value
simonjayhawkins Nov 16, 2020
11eb08f
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 16, 2020
52440a7
use fill_mask in pa indices_array
simonjayhawkins Nov 16, 2020
bd05c2c
add comment to __gettem__
simonjayhawkins Nov 16, 2020
27c8de5
add comment on pyarrow compute
simonjayhawkins Nov 17, 2020
b6713e9
privatize `data`
simonjayhawkins Nov 17, 2020
125cb6f
Merge remote-tracking branch 'upstream/master' into arrow-string-array
simonjayhawkins Nov 17, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,9 +452,13 @@ def astype(self, dtype, copy=True):
NumPy ndarray with 'dtype' for its dtype.
"""
from pandas.core.arrays.string_ import StringDtype
from pandas.core.arrays.string_arrow import ArrowStringDtype

dtype = pandas_dtype(dtype)
if isinstance(dtype, StringDtype): # allow conversion to StringArrays
# FIXME: Really hard-code here?
if isinstance(
dtype, (ArrowStringDtype, StringDtype)
): # allow conversion to StringArrays
return dtype.construct_array_type()._from_sequence(self, copy=False)

return np.array(self, dtype=dtype, copy=copy)
Expand Down
309 changes: 309 additions & 0 deletions pandas/core/arrays/string_arrow.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
from collections.abc import Iterable
from typing import Tuple, Type, Union

import numpy as np
import pyarrow as pa

from pandas._libs import missing as libmissing

from pandas.core.dtypes.base import ExtensionDtype
from pandas.core.dtypes.dtypes import register_extension_dtype

import pandas as pd
from pandas.api.types import is_array_like, is_bool_dtype, is_integer, is_integer_dtype
from pandas.core.arrays.base import ExtensionArray
from pandas.core.indexers import check_array_indexer


@register_extension_dtype
class ArrowStringDtype(ExtensionDtype):
"""
Extension dtype for string data in a ``pyarrow.ChunkedArray``.

.. versionadded:: 1.1.0

.. warning::

ArrowStringDtype is considered experimental. The implementation and
parts of the API may change without warning.

Attributes
----------
None

Methods
-------
None

Examples
--------
>>> pd.ArrowStringDtype()
ArrowStringDtype
"""

name = "arrow_string"

#: StringDtype.na_value uses pandas.NA
na_value = libmissing.NA

@property
def type(self) -> Type[str]:
return str

@classmethod
def construct_array_type(cls) -> Type["ArrowStringArray"]:
"""
Return the array type associated with this dtype.

Returns
-------
type
"""
return ArrowStringArray

def __hash__(self) -> int:
return hash("ArrowStringDtype")

def __repr__(self) -> str:
return "ArrowStringDtype"

def __from_arrow__(
self, array: Union["pa.Array", "pa.ChunkedArray"]
) -> "ArrowStringArray":
"""
Construct StringArray from pyarrow Array/ChunkedArray.
"""
return ArrowStringArray(array)

def __eq__(self, other) -> bool:
"""Check whether 'other' is equal to self.

By default, 'other' is considered equal if
* it's a string matching 'self.name'.
* it's an instance of this type.

Parameters
----------
other : Any

Returns
-------
bool
"""
if isinstance(other, ArrowStringDtype):
return True
elif isinstance(other, str) and other == "arrow_string":
return True
else:
return False


class ArrowStringArray(ExtensionArray):
"""
Extension array for string data in a ``pyarrow.ChunkedArray``.

.. versionadded:: 1.1.0

.. warning::

ArrowStringArray is considered experimental. The implementation and
parts of the API may change without warning.

Parameters
----------
values : pyarrow.Array or pyarrow.ChunkedArray
The array of data.

Attributes
----------
None

Methods
-------
None

See Also
--------
array
The recommended function for creating a ArrowStringArray.
Series.str
The string methods are available on Series backed by
a ArrowStringArray.

Notes
-----
ArrowStringArray returns a BooleanArray for comparison methods.

Examples
--------
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="arrow_string")
<ArrowStringArray>
['This is', 'some text', <NA>, 'data.']
Length: 4, dtype: arrow_string
"""

def __init__(self, values):
if isinstance(values, pa.Array):
self.data = pa.chunked_array([values])
elif isinstance(values, pa.ChunkedArray):
self.data = values
else:
raise ValueError(f"Unsupported type '{type(values)}' for ArrowStringArray")

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
# TODO(ARROW-9407): Accept pd.NA in Arrow
scalars_corrected = [None if pd.isna(x) else x for x in scalars]
return cls(pa.array(scalars_corrected, type=pa.string()))

@property
def dtype(self) -> ArrowStringDtype:
"""
An instance of 'ArrowStringDtype'.
"""
return ArrowStringDtype()

def __array__(self, *args, **kwargs) -> "np.ndarray":
"""Correctly construct numpy arrays when passed to `np.asarray()`."""
return self.data.__array__(*args, **kwargs)

def __arrow_array__(self, type=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can type

"""Convert myself to a pyarrow Array or ChunkedArray."""
return self.data

@property
def size(self) -> int:
"""
Return the number of elements in this array.

Returns
-------
size : int
"""
return len(self.data)

@property
def shape(self) -> Tuple[int]:
"""Return the shape of the data."""
# This may be patched by pandas to support pseudo-2D operations.
return (len(self.data),)

@property
def ndim(self) -> int:
"""Return the number of dimensions of the underlying data."""
return 1

def __len__(self) -> int:
"""
Length of this array.

Returns
-------
length : int
"""
return len(self.data)

@classmethod
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type input args as much as possile

return cls._from_sequence(strings, dtype=dtype, copy=copy)

def __getitem__(self, item):
# type (Any) -> Any
"""Select a subset of self.

Parameters
----------
item : int, slice, or ndarray
* int: The position in 'self' to get.
* slice: A slice object, where 'start', 'stop', and 'step' are
integers or None
* ndarray: A 1-d boolean NumPy ndarray the same length as 'self'

Returns
-------
item : scalar or ExtensionArray

Notes
-----
For scalar ``item``, return a scalar value suitable for the array's
type. This should be an instance of ``self.dtype.type``.
For slice ``key``, return an instance of ``ExtensionArray``, even
if the slice is length 0 or 1.
For a boolean mask, return an instance of ``ExtensionArray``, filtered
to the values where ``item`` is True.
"""
item = check_array_indexer(self, item)

if isinstance(item, Iterable):
if not is_array_like(item):
item = np.array(item)
if len(item) == 0:
return type(self)(pa.chunked_array([], type=pa.string()))
elif is_integer_dtype(item):
return self.take(item)
elif is_bool_dtype(item):
return type(self)(self.data.filter(item))
else:
raise IndexError(
"Only integers, slices and integer or "
"boolean arrays are valid indices."
)
elif is_integer(item):
if item < 0:
item += len(self)
if item >= len(self):
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this raise an error instead?


value = self.data[item]
if isinstance(value, pa.ChunkedArray):
return type(self)(value)
else:
return value.as_py()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None needs to be replaced here with pd.NA, I think?


def __setitem__(self, key, value):
raise NotImplementedError("__setitem__")

def fillna(self, value=None, method=None, limit=None):
raise NotImplementedError("fillna")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting with pyarrow 1.0, there is a pyarrow.compute.fill_null that does this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with pc.fill_null is that it only supports scalars but pandas also allows arrays as an input to fillna as well as one can limit the number of values to replace. This is both not supported by fill_null and we thus need to fallback in these cases to object-based methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have copied the fletcher implementation as a starting point.


def _reduce(self, name, skipna=True, **kwargs):
if name in ["min", "max"]:
return getattr(self, name)(skipna=skipna)

raise TypeError(f"Cannot perform reduction '{name}' with string dtype")

@property
def nbytes(self) -> int:
"""
The number of bytes needed to store this object in memory.
"""
size = 0
for chunk in self.data.chunks:
for buf in chunk.buffers():
if buf is not None:
size += buf.size
return size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChunkedArray has an nbytes property nowawadays, so I think this can be return self.data.nbytes


def isna(self) -> np.ndarray:
"""
Boolean NumPy array indicating if each value is missing.

This should return a 1-D array the same length as 'self'.
"""
return self.data.is_null()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns a pyarrow array, right? Probably want to convert it into a pandas BooleanArray (to use the nullable boolean dtype). BooleanDtype.__from_arrow__ implements a conversion (although I think that needs to be optimized; separate issue though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this cannot be null, I will return a numpy array here. This is also what the current masked pandas arrays do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comment on what's preferable, but the interface does allow for non-ndarrays here. SparseArray.isna() returns a Sparse[bool] I think.


def copy(self):
# type: () -> ExtensionArray
"""
Return a copy of the array.

Parameters
----------
deep : bool, default False
Also copy the underlying data backing this array.

Returns
-------
ExtensionArray
"""
return type(self)(self.data)
Loading