Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update user-defined-functions for 0.19.x #13071

Merged
merged 2 commits into from
Dec 17, 2023

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented Dec 16, 2023

Closes #11104

I'd like to make a separate PR to do a bit of an overhaul of this page, as it's quite complex and am not too keen on starting off with a gotcha...but that's for a separate PR, first I'd suggest just updating the syntax


The Rust examples are similar to the Python ones, but don't actually match 100% (this is already the case, regardless of this PR) - for example, the Python dataframe has i64 values, whereas the Rust one i32. No big deal, just something I noticed


this is how the rewritten part in the middle looks like now:

Preview (click here)

Screenshot 2023-12-16 202304
Screenshot 2023-12-16 202316

@github-actions github-actions bot added documentation Improvements or additions to documentation python Related to Python Polars rust Related to Rust Polars labels Dec 16, 2023
Comment on lines +14 to +23
print(df)
# --8<-- [end:dataframe]

# --8<-- [start:shift_map_batches]
out = df.group_by("keys", maintain_order=True).agg(
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map"),
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map_batches"),
pl.col("values").shift().alias("shift_expression"),
)
print(df)
# --8<-- [end:dataframe]
print(out)
# --8<-- [end:shift_map_batches]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, this snippet creates df, then creates out, then prints df. But out is never used - instead, in the .md file, the output of out is hard-coded.

I'm suggesting to, instead, split the snippet into two:

  • create df, and show it
  • create out, and show it, without hard-coding any output

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@MarcoGorelli MarcoGorelli force-pushed the update-udf-docs branch 3 times, most recently from c21cb76 to 356bd52 Compare December 16, 2023 20:18
achieve the same goals.

### Adding a counter

In this example we create a global `counter` and then add the integer `1` to the global state at every element processed.
Every iteration the result of the increment will be added to the element value.

> Note, this example isn't provided in Rust. The reason is that the global `counter` value would lead to data races when this apply is evaluated in parallel. It would be possible to wrap it in a `Mutex` to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees.
> Note, this example isn't provided in Rust. The reason is that the global `counter` value would lead to data races when this `apply` is evaluated in parallel. It would be possible to wrap it in a `Mutex` to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping this one as apply because that's still the name on the Rust side

@@ -45,9 +41,9 @@ df.with_columns([
])
```

Use cases for `map` in the `group_by` context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why.
Use cases for `map_batches` in the `group_by` context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest I don't really understand this phrase to begin with - what are the performance reasons to use map_batch? Or is that only on the Rust side, referring to map?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do an elementwise operations with map batches. E.g. lambda x * 2 would be correct in both.

Comment on lines -50 to +46
{{code_block('user-guide/expressions/user-defined-functions','dataframe',['map'])}}
{{code_block('user-guide/expressions/user-defined-functions','dataframe',[])}}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now this snippet just creates a dataframe, so I've removed the map reference and put it in the next snippet (as map_batches)

@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 17, 2023 08:42
@MarcoGorelli MarcoGorelli marked this pull request as draft December 17, 2023 08:46
Comment on lines +92 to +95
=== ":fontawesome-brands-python: Python"
[:material-api: `map_elements`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.map_elements.html)

{{code_block('user-guide/expressions/user-defined-functions','apply',['apply'])}}
{{code_block('user-guide/expressions/user-defined-functions','map_elements',[])}}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_elements (and map_batches) don't appear in the Rust API docs

To avoid warnings in the docs build, I've just added the Python-only link to map_elements above

@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 17, 2023 09:08
out = df.group_by("keys", maintain_order=True).agg(
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map"),
pl.col("values").map_batches(lambda s: s.shift()).alias("shift_map_batches"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be map_elements? A shift here would be incorrect? (Don't read the context).

Copy link
Collaborator Author

@MarcoGorelli MarcoGorelli Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of this section is to show how using map_batches within group_by leads to incorrect (or at least, unexpected) results

So although this should be map_elements, the way it's written it:

  1. shows that the "wrong" one (map_batches) gives unexpected results
  2. shows that the "correct" one (map_elements) gives the expected results

Copy link
Member

@ritchie46 ritchie46 Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Reviewing lost snippets is hard. 😅

@ritchie46 ritchie46 merged commit d21713a into pola-rs:main Dec 17, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update UDF section for latest version
2 participants