-
Notifications
You must be signed in to change notification settings - Fork 175
feat: add list aggregate methods #3332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7b15d52 to
8e04fc1
Compare
FBruzzesi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raisadz - this looks amazing! Just left a comment to possibly support more in the spark-like case 😇
FBruzzesi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raisadz, I have a couple more comments, apologies for the fragmented review 🙈
- Could you add a couple of test cases:
a. All nulls in list
b. Empty list
c.polars.Expr.list.sumsays: If there are no non-null elements in a row, the output is 0. For the other aggregations it's unclear what the output should be, and I wonder how consistent it is across all different backends. - Could you mix the docstring examples a bit other than polars?
| def list_agg( | ||
| array: ChunkedArrayAny, | ||
| func: Literal["min", "max", "mean", "approximate_median", "sum"], | ||
| ) -> ChunkedArrayAny: | ||
| return ( | ||
| pa.Table.from_arrays( | ||
| [pc.list_flatten(array), pc.list_parent_indices(array)], | ||
| names=["values", "offsets"], | ||
| ) | ||
| .group_by("offsets") | ||
| .aggregate([("values", func)]) | ||
| .column(f"values_{func}") | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raisadz I'm pretty excited by this! 😄
+1 from me on (#3332 (review))
I've just tried this out with the test case for list.unique:
| data = {"a": [[2, 2, 3, None, None], None, [], [None]]} |
The result for that should be:
[[None, 2, 3], None, [], [None]]
But using list_agg seems to have dropped 2/4 lists and all nulls 🤔
import pyarrow as pa
data = {"a": [[2, 2, 3, None, None], None, [], [None]]}
ca = pa.chunked_array([pa.array(data["a"])])
result = list_agg(ca, "distinct").to_pylist()
print(result)[[2, 3], []]
I managed to get slightly closer to what we want, by passing in options for the group_by:
Show list_agg_opts
from typing import Any
import pyarrow as pa
import pyarrow.compute as pc
def list_agg_opts(
array: pa.ChunkedArray[Any], func: Any, options: Any = None
) -> pa.ChunkedArray[Any]:
return (
pa.Table.from_arrays(
[pc.list_flatten(array), pc.list_parent_indices(array)],
names=["values", "offsets"],
)
.group_by("offsets")
.aggregate([("values", func, options)]) # <-------
.column(f"values_{func}")
)These are the correct results for 2/4 of the lists 🎉
But where did the other 2 go? 😳
result = list_agg_opts(ca, "distinct", pc.CountOptions("all")).to_pylist()
print(result)[[2, 3, None], [None]]
Edit: I missed it myself lol, fixed in (d8363e1)
Description
The following list methods are implemented:
What type of PR is this? (check all applicable)
Related issues
Checklist