Skip to content

Conversation

@raisadz
Copy link
Contributor

@raisadz raisadz commented Nov 28, 2025

Description

The following list methods are implemented:

    - list.max
    - list.mean
    - list.median
    - list.min
    - list.sum

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

@raisadz raisadz marked this pull request as ready for review November 28, 2025 12:29
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - this looks amazing! Just left a comment to possibly support more in the spark-like case 😇

@FBruzzesi FBruzzesi added enhancement New feature or request nested data `list`, `struct`, etc labels Nov 28, 2025
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz, I have a couple more comments, apologies for the fragmented review 🙈

  1. Could you add a couple of test cases:
    a. All nulls in list
    b. Empty list
    c. polars.Expr.list.sum says: If there are no non-null elements in a row, the output is 0. For the other aggregations it's unclear what the output should be, and I wonder how consistent it is across all different backends.
  2. Could you mix the docstring examples a bit other than polars?

Comment on lines +500 to +512
def list_agg(
array: ChunkedArrayAny,
func: Literal["min", "max", "mean", "approximate_median", "sum"],
) -> ChunkedArrayAny:
return (
pa.Table.from_arrays(
[pc.list_flatten(array), pc.list_parent_indices(array)],
names=["values", "offsets"],
)
.group_by("offsets")
.aggregate([("values", func)])
.column(f"values_{func}")
)
Copy link
Member

@dangotbanned dangotbanned Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raisadz I'm pretty excited by this! 😄

+1 from me on (#3332 (review))


I've just tried this out with the test case for list.unique:

data = {"a": [[2, 2, 3, None, None], None, [], [None]]}

The result for that should be:

[[None, 2, 3], None, [], [None]]

But using list_agg seems to have dropped 2/4 lists and all nulls 🤔

import pyarrow as pa

data = {"a": [[2, 2, 3, None, None], None, [], [None]]}
ca = pa.chunked_array([pa.array(data["a"])])
result = list_agg(ca, "distinct").to_pylist()
print(result)
[[2, 3], []]

I managed to get slightly closer to what we want, by passing in options for the group_by:

Show list_agg_opts

from typing import Any

import pyarrow as pa
import pyarrow.compute as pc

def list_agg_opts(
    array: pa.ChunkedArray[Any], func: Any, options: Any = None
) -> pa.ChunkedArray[Any]:
    return (
        pa.Table.from_arrays(
            [pc.list_flatten(array), pc.list_parent_indices(array)],
            names=["values", "offsets"],
        )
        .group_by("offsets")
        .aggregate([("values", func, options)])  # <-------
        .column(f"values_{func}")
    )

These are the correct results for 2/4 of the lists 🎉

But where did the other 2 go? 😳

result = list_agg_opts(ca, "distinct", pc.CountOptions("all")).to_pylist()
print(result)
[[2, 3, None], [None]]

Edit: I missed it myself lol, fixed in (d8363e1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request nested data `list`, `struct`, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants