-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: update user-defined-functions for 0.19.x #13071
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,22 +11,25 @@ | |
"values": [10, 7, 1], | ||
} | ||
) | ||
print(df) | ||
# --8<-- [end:dataframe] | ||
|
||
# --8<-- [start:shift_map_batches] | ||
out = df.group_by("keys", maintain_order=True).agg( | ||
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map"), | ||
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map_batches"), | ||
pl.col("values").shift().alias("shift_expression"), | ||
) | ||
print(df) | ||
# --8<-- [end:dataframe] | ||
print(out) | ||
# --8<-- [end:shift_map_batches] | ||
Comment on lines
+14
to
+23
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. currently, this snippet creates I'm suggesting to, instead, split the snippet into two:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good. |
||
|
||
|
||
# --8<-- [start:apply] | ||
# --8<-- [start:map_elements] | ||
out = df.group_by("keys", maintain_order=True).agg( | ||
pl.col("values").map_elements(lambda s: s.shift()).alias("shift_map"), | ||
pl.col("values").map_elements(lambda s: s.shift()).alias("shift_map_elements"), | ||
pl.col("values").shift().alias("shift_expression"), | ||
) | ||
print(out) | ||
# --8<-- [end:apply] | ||
# --8<-- [end:map_elements] | ||
|
||
# --8<-- [start:counter] | ||
counter = 0 | ||
|
@@ -39,7 +42,7 @@ def add_counter(val: int) -> int: | |
|
||
|
||
out = df.select( | ||
pl.col("values").map_elements(add_counter).alias("solution_apply"), | ||
pl.col("values").map_elements(add_counter).alias("solution_map_elements"), | ||
(pl.col("values") + pl.int_range(1, pl.count() + 1)).alias("solution_expr"), | ||
) | ||
print(out) | ||
|
@@ -49,7 +52,7 @@ def add_counter(val: int) -> int: | |
out = df.select( | ||
pl.struct(["keys", "values"]) | ||
.map_elements(lambda x: len(x["keys"]) + x["values"]) | ||
.alias("solution_apply"), | ||
.alias("solution_map_elements"), | ||
(pl.col("keys").str.len_bytes() + pl.col("values")).alias("solution_expr"), | ||
) | ||
print(out) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,5 @@ | ||
# User-defined functions (Python) | ||
|
||
!!! warning "Not updated for Python Polars `0.19.0`" | ||
|
||
This section of the user guide still needs to be updated for the latest Polars release. | ||
|
||
You should be convinced by now that Polars expressions are so powerful and flexible that there is much less need for custom Python functions | ||
than in other libraries. | ||
|
||
|
@@ -12,28 +8,28 @@ over data in Polars. | |
|
||
For this we provide the following expressions: | ||
|
||
- `map` | ||
- `apply` | ||
- `map_batches` | ||
- `map_elements` | ||
|
||
## To `map` or to `apply`. | ||
## To `map_batches` or to `map_elements`. | ||
|
||
These functions have an important distinction in how they operate and consequently what data they will pass to the user. | ||
|
||
A `map` passes the `Series` backed by the `expression` as is. | ||
A `map_batches` passes the `Series` backed by the `expression` as is. | ||
|
||
`map` follows the same rules in both the `select` and the `group_by` context, this will | ||
`map_batches` follows the same rules in both the `select` and the `group_by` context, this will | ||
mean that the `Series` represents a column in a `DataFrame`. Note that in the `group_by` context, that column is not yet | ||
aggregated! | ||
|
||
Use cases for `map` are for instance passing the `Series` in an expression to a third party library. Below we show how | ||
we could use `map` to pass an expression column to a neural network model. | ||
Use cases for `map_batches` are for instance passing the `Series` in an expression to a third party library. Below we show how | ||
we could use `map_batches` to pass an expression column to a neural network model. | ||
|
||
=== ":fontawesome-brands-python: Python" | ||
[:material-api: `map`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.map.html) | ||
[:material-api: `map_batches`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.map_batches.html) | ||
|
||
```python | ||
df.with_columns([ | ||
pl.col("features").map(lambda s: MyNeuralNetwork.forward(s.to_numpy())).alias("activations") | ||
pl.col("features").map_batches(lambda s: MyNeuralNetwork.forward(s.to_numpy())).alias("activations") | ||
]) | ||
``` | ||
|
||
|
@@ -45,9 +41,9 @@ df.with_columns([ | |
]) | ||
``` | ||
|
||
Use cases for `map` in the `group_by` context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why. | ||
Use cases for `map_batches` in the `group_by` context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to be honest I don't really understand this phrase to begin with - what are the performance reasons to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could do an elementwise operations with map batches. E.g. |
||
|
||
{{code_block('user-guide/expressions/user-defined-functions','dataframe',['map'])}} | ||
{{code_block('user-guide/expressions/user-defined-functions','dataframe',[])}} | ||
Comment on lines
-50
to
+46
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. now this snippet just creates a dataframe, so I've removed the |
||
|
||
```python exec="on" result="text" session="user-guide/udf" | ||
--8<-- "python/user-guide/expressions/user-defined-functions.py:setup" | ||
|
@@ -68,75 +64,65 @@ If we would then apply a `shift` operation to the right, we'd expect: | |
"b" -> [null] | ||
``` | ||
|
||
Now, let's print and see what we've got. | ||
Let's try that out and see what we get: | ||
|
||
```python | ||
print(out) | ||
``` | ||
{{code_block('user-guide/expressions/user-defined-functions','shift_map_batches',['map_batches'])}} | ||
|
||
``` | ||
shape: (2, 3) | ||
┌──────┬────────────┬──────────────────┐ | ||
│ keys ┆ shift_map ┆ shift_expression │ | ||
│ --- ┆ --- ┆ --- │ | ||
│ str ┆ list[i64] ┆ list[i64] │ | ||
╞══════╪════════════╪══════════════════╡ | ||
│ a ┆ [null, 10] ┆ [null, 10] │ | ||
│ b ┆ [7] ┆ [null] │ | ||
└──────┴────────────┴──────────────────┘ | ||
```python exec="on" result="text" session="user-guide/udf" | ||
--8<-- "python/user-guide/expressions/user-defined-functions.py:shift_map_batches" | ||
``` | ||
|
||
Ouch.. we clearly get the wrong results here. Group `"b"` even got a value from group `"a"` 😵. | ||
|
||
This went horribly wrong, because the `map` applies the function before we aggregate! So that means the whole column `[10, 7, 1`\] got shifted to `[null, 10, 7]` and was then aggregated. | ||
This went horribly wrong, because the `map_batches` applies the function before we aggregate! So that means the whole column `[10, 7, 1`\] got shifted to `[null, 10, 7]` and was then aggregated. | ||
|
||
So my advice is to never use `map` in the `group_by` context unless you know you need it and know what you are doing. | ||
So my advice is to never use `map_batches` in the `group_by` context unless you know you need it and know what you are doing. | ||
|
||
## To `apply` | ||
## To `map_elements` | ||
|
||
Luckily we can fix previous example with `apply`. `apply` works on the smallest logical elements for that operation. | ||
Luckily we can fix previous example with `map_elements`. `map_elements` works on the smallest logical elements for that operation. | ||
|
||
That is: | ||
|
||
- `select context` -> single elements | ||
- `group by context` -> single groups | ||
|
||
So with `apply` we should be able to fix our example: | ||
So with `map_elements` we should be able to fix our example: | ||
|
||
{{code_block('user-guide/expressions/user-defined-functions','apply',['apply'])}} | ||
{{code_block('user-guide/expressions/user-defined-functions','map_elements',['map_elements'])}} | ||
|
||
```python exec="on" result="text" session="user-guide/udf" | ||
--8<-- "python/user-guide/expressions/user-defined-functions.py:apply" | ||
--8<-- "python/user-guide/expressions/user-defined-functions.py:map_elements" | ||
``` | ||
|
||
And observe, a valid result! 🎉 | ||
|
||
## `apply` in the `select` context | ||
## `map_elements` in the `select` context | ||
|
||
In the `select` context, the `apply` expression passes elements of the column to the Python function. | ||
In the `select` context, the `map_elements` expression passes elements of the column to the Python function. | ||
|
||
_Note that you are now running Python, this will be slow._ | ||
|
||
Let's go through some examples to see what to expect. We will continue with the `DataFrame` we defined at the start of | ||
this section and show an example with the `apply` function and a counter example where we use the expression API to | ||
this section and show an example with the `map_elements` function and a counter example where we use the expression API to | ||
achieve the same goals. | ||
|
||
### Adding a counter | ||
|
||
In this example we create a global `counter` and then add the integer `1` to the global state at every element processed. | ||
Every iteration the result of the increment will be added to the element value. | ||
|
||
> Note, this example isn't provided in Rust. The reason is that the global `counter` value would lead to data races when this apply is evaluated in parallel. It would be possible to wrap it in a `Mutex` to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees. | ||
> Note, this example isn't provided in Rust. The reason is that the global `counter` value would lead to data races when this `apply` is evaluated in parallel. It would be possible to wrap it in a `Mutex` to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. keeping this one as |
||
|
||
{{code_block('user-guide/expressions/user-defined-functions','counter',['apply'])}} | ||
{{code_block('user-guide/expressions/user-defined-functions','counter',['map_elements'])}} | ||
|
||
```python exec="on" result="text" session="user-guide/udf" | ||
--8<-- "python/user-guide/expressions/user-defined-functions.py:counter" | ||
``` | ||
|
||
### Combining multiple column values | ||
|
||
If we want to have access to values of different columns in a single `apply` function call, we can create `struct` data | ||
If we want to have access to values of different columns in a single `map_elements` function call, we can create `struct` data | ||
type. This data type collects those columns as fields in the `struct`. So if we'd create a struct from the columns | ||
`"keys"` and `"values"`, we would get the following struct elements: | ||
|
||
|
@@ -150,7 +136,7 @@ type. This data type collects those columns as fields in the `struct`. So if we' | |
|
||
In Python, those would be passed as `dict` to the calling Python function and can thus be indexed by `field: str`. In Rust, you'll get a `Series` with the `Struct` type. The fields of the struct can then be indexed and downcast. | ||
|
||
{{code_block('user-guide/expressions/user-defined-functions','combine',['apply','struct'])}} | ||
{{code_block('user-guide/expressions/user-defined-functions','combine',['map_elements','struct'])}} | ||
|
||
```python exec="on" result="text" session="user-guide/udf" | ||
--8<-- "python/user-guide/expressions/user-defined-functions.py:combine" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
map_elements
? A shift here would be incorrect? (Don't read the context).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the purpose of this section is to show how using
map_batches
withingroup_by
leads to incorrect (or at least, unexpected) resultsSo although this should be
map_elements
, the way it's written it:map_batches
) gives unexpected resultsmap_elements
) gives the expected resultsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! Reviewing lost snippets is hard. 😅