Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
DOC: User Guide Page on user-defined functions #61195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DOC: User Guide Page on user-defined functions #61195
Changes from 15 commits
3f94137
bf984ca
fe67ec8
4ec5697
11392d7
f322d9e
b6b7b02
d20bcc7
72f7b62
90a2d24
0d02d64
214f0ac
fffaad0
561a1f5
c6891a0
f56ec28
c00d1d2
8d41537
467bc93
efd5201
af7964b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "change the data differently" sounds very close to mutating in a UDF, which we explicitly do not support. What do you think of "behave differently".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Behave differently” sounds clearer and avoids implying mutation. I'll update it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just personal opinion, but to me it makes more sense to explain what UDFs are in pandas before explaining when not to use them. This order seems reasonable assuming users already know what pandas udfs are in practice, but I'd personally prefer not to assume it in the user guide for UDFs.
In my opinion, after the previous introduction which is great, I'd show a very simple example so we make sure users reading this understand the very basics.
Something like:
Building on top of this, like then showing the same with a
DataFrame
, at some point showing UDFs that receive the whole column with.apply
... should help make sure users are following and understanding all the information provided here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit negative here. This is duplicating a lot of other documentation that we already have. I think we should instead link to that documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind pointing out to an specific example @rhshadrach? I found documentation for the aggregate functions, but not much for the
map
,apply
... onSeries
andDataFrame
other than in the API docs. I agree with not having much duplication. Personally, if there is few here and there like in the FAQs, Performance page... I'd rather have the docs related to these methods in this page, as it feels like the natural place, and link to the sections here in the FAQs, performance hints, groupby user guide... Of course there can be cases where it makes more sense the opposite, but maybe we can discuss the specific cases where there is duplication.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply: https://pandas.pydata.org/docs/user_guide/basics.html#row-or-column-wise-function-application
map: https://pandas.pydata.org/docs/user_guide/basics.html#applying-elementwise-functions
If we are going to move the docs on e.g.
DataFrame.agg
here, then this no longer is a page just about UDFs asDataFrame.agg
does more than just use UDFs. In addition, that seems like a large reworking of the docs for little (in my opinion, actually negative) benefit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed the Essential basic functionality page, thanks for pointing that out. Fully agree with you that what I proposed here is repeating again the whole https://pandas.pydata.org/docs/user_guide/basics.html#function-application section . And I agree that's not a good idea.
Personally, I'd rather not have that section, and have that content here. At least in my experience, map and apply are common, but not essential as other parts described in that page. And also, I think the structure of the user guide will be clearer and easier to find things with the changes.
For the
DataFrame.agg
, there is already a groupby page, and I think just having the methods in the lists of methods that support udfs would be good, and then just a mention that points out to the group by page where all the detail explanation regarding groupping is presented with examples.There may be other structures, but what I'd like is that we can give users structure to the related methods. I think
Series
has around 200 methods and attributes. Users having to navigate that whole API to find out themselves that map, apply and pipe are kind of the same just changing the input of the udf, doesn't seem ideal. I think this page here can really help in that.What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move the main document of
apply
here, then I am quite opposed to calling this a page on UDFs as apply does more than just take UDFs. By documentingapply("sum")
et al here, it seems to me we make this page far less clear than leaving it as solely UDFs.In any case, is that something you think should be tackled in this PR? This PR started as
I do not think we should morph it into moving around documentation from other places, especially when there are disagreements.
Which is why I think this page should be a comparison of UDF methods (as it mostly is now), while pointing to more thorough documentation elsewhere in the User Guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I think I understand your point better now. Maybe I'd like to improve a bit the apply/maps docs in essential, but that's unrelated to this PR. And happy to move forward here focussing on the UDFs and not on the methods, as you describe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if Sphinx is more flexible now, but this had to be the same exact length as the title before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Title marker needs to be at least as long as the text, but can be longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we should remove
groups of data
here.DataFrame.apply
that you're referencing doesn't operate on groups, and you mention groupby below.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make a mention of resample, rolling, expanding, and ewm. Perhaps link to each section in the User Guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the other objects to this note, it seems to me they all belong together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd only keep the first table. I think the description can already explain anything not clear present in the second table.
And I'd make this section
choosing the right method
part of theMethods that support User-Defined Functions
. To me it feels like we're having two sections for mostly the same.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this for all methods, as the title section and every mention of
apply
will already be a link.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things like
.agg(["sum", "mean"])
aren't UDFs, so I don't think they should be mentioned here, and it could be make users think these types of usages are slow (they are not).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an example of a UDF. I really like your example of using linear regression - can we do that here? It's a bit unfortunate that groupby.transform does not allow operating on the entire group (only works column-by-column) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure on having
filter
here for now. I think it's very good that you added it, as it doesn't support udfs, but it probably should. So, it opens a discussion we probably want to have about adding them. @rhshadrach thoughts?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the reason this was added is that
DataFrameGroupBy.filter
does accept UDFs. Perhaps that should be mentioned instead?I actually think
DataFrame.filter
should accept Boolean masks, similar to PySpark and Polars. But agreed that discussion is not for here!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think map is more performant than apply, in general apply will be more performant. I'd remove this as a reason, as it depends a lot on the use case, and I don't think this is a general rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Just curious: Why is applying generally more performant than mapping in practice? I thought map was faster for simple element-wise UDFs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In map, without anything special, pandas should necessarily loop in Python over the elements, casting each to a PyObject. In apply the same can be true, but since a whole column is given to the user, the user could also use vectorization. Like, a simple function doing a
x + 1
would be fast when using apply and slow when using map. That's why I wouldn't say map is faster.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just personal preference, but these last 3 sections seem to be talking about the same (performance), I'd have just a section about performance.
I'd keep it short for now, and we can iterate over it later. The reason is that each time we review this before merging it we need to re-read the whole document. So, if we can finish the main part above first, and have this as a placeholder, then in a second PR we can focus more on performance without having to keep reviewing the first part again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vectorized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning and comparing also
.pipe
, which is both vectorized and a udf?