-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: update user-defined-functions for 0.19.x #13071
Conversation
print(df) | ||
# --8<-- [end:dataframe] | ||
|
||
# --8<-- [start:shift_map_batches] | ||
out = df.group_by("keys", maintain_order=True).agg( | ||
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map"), | ||
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map_batches"), | ||
pl.col("values").shift().alias("shift_expression"), | ||
) | ||
print(df) | ||
# --8<-- [end:dataframe] | ||
print(out) | ||
# --8<-- [end:shift_map_batches] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, this snippet creates df
, then creates out
, then prints df
. But out
is never used - instead, in the .md
file, the output of out
is hard-coded.
I'm suggesting to, instead, split the snippet into two:
- create
df
, and show it - create
out
, and show it, without hard-coding any output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
557fe67
to
4c045c1
Compare
c21cb76
to
356bd52
Compare
achieve the same goals. | ||
|
||
### Adding a counter | ||
|
||
In this example we create a global `counter` and then add the integer `1` to the global state at every element processed. | ||
Every iteration the result of the increment will be added to the element value. | ||
|
||
> Note, this example isn't provided in Rust. The reason is that the global `counter` value would lead to data races when this apply is evaluated in parallel. It would be possible to wrap it in a `Mutex` to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees. | ||
> Note, this example isn't provided in Rust. The reason is that the global `counter` value would lead to data races when this `apply` is evaluated in parallel. It would be possible to wrap it in a `Mutex` to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping this one as apply
because that's still the name on the Rust side
@@ -45,9 +41,9 @@ df.with_columns([ | |||
]) | |||
``` | |||
|
|||
Use cases for `map` in the `group_by` context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why. | |||
Use cases for `map_batches` in the `group_by` context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest I don't really understand this phrase to begin with - what are the performance reasons to use map_batch
? Or is that only on the Rust side, referring to map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do an elementwise operations with map batches. E.g. lambda x * 2
would be correct in both.
{{code_block('user-guide/expressions/user-defined-functions','dataframe',['map'])}} | ||
{{code_block('user-guide/expressions/user-defined-functions','dataframe',[])}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now this snippet just creates a dataframe, so I've removed the map
reference and put it in the next snippet (as map_batches
)
=== ":fontawesome-brands-python: Python" | ||
[:material-api: `map_elements`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.map_elements.html) | ||
|
||
{{code_block('user-guide/expressions/user-defined-functions','apply',['apply'])}} | ||
{{code_block('user-guide/expressions/user-defined-functions','map_elements',[])}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map_elements
(and map_batches
) don't appear in the Rust API docs
To avoid warnings in the docs build, I've just added the Python-only link to map_elements
above
out = df.group_by("keys", maintain_order=True).agg( | ||
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map"), | ||
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map_batches"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be map_elements
? A shift here would be incorrect? (Don't read the context).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the purpose of this section is to show how using map_batches
within group_by
leads to incorrect (or at least, unexpected) results
So although this should be map_elements
, the way it's written it:
- shows that the "wrong" one (
map_batches
) gives unexpected results - shows that the "correct" one (
map_elements
) gives the expected results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! Reviewing lost snippets is hard. 😅
Closes #11104
I'd like to make a separate PR to do a bit of an overhaul of this page, as it's quite complex and am not too keen on starting off with a gotcha...but that's for a separate PR, first I'd suggest just updating the syntax
The Rust examples are similar to the Python ones, but don't actually match 100% (this is already the case, regardless of this PR) - for example, the Python dataframe has
i64
values, whereas the Rust onei32
. No big deal, just something I noticedthis is how the rewritten part in the middle looks like now:
Preview (click here)