-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: User Guide Page on user-defined functions #61195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 16 commits
3f94137
bf984ca
fe67ec8
4ec5697
11392d7
f322d9e
b6b7b02
d20bcc7
72f7b62
90a2d24
0d02d64
214f0ac
fffaad0
561a1f5
c6891a0
f56ec28
c00d1d2
8d41537
467bc93
efd5201
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,6 +78,7 @@ Guides | |
boolean | ||
visualization | ||
style | ||
user_defined_functions | ||
groupby | ||
window | ||
timeseries | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,301 @@ | ||||||||||||||||||||||
.. _user_defined_functions: | ||||||||||||||||||||||
|
||||||||||||||||||||||
{{ header }} | ||||||||||||||||||||||
|
||||||||||||||||||||||
***************************** | ||||||||||||||||||||||
User-Defined Functions (UDFs) | ||||||||||||||||||||||
***************************** | ||||||||||||||||||||||
|
||||||||||||||||||||||
In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s | ||||||||||||||||||||||
functionality by allowing users to apply custom computations to their data. While | ||||||||||||||||||||||
pandas comes with a set of built-in functions for data manipulation, UDFs offer | ||||||||||||||||||||||
flexibility when built-in methods are not sufficient. These functions can be | ||||||||||||||||||||||
applied at different levels: element-wise, row-wise, column-wise, or group-wise, | ||||||||||||||||||||||
and behave differently, depending on the method used. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Here’s a simple example to illustrate a UDF applied to a Series: | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. ipython:: python | ||||||||||||||||||||||
|
||||||||||||||||||||||
s = pd.Series([1, 2, 3]) | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Simple UDF that adds 1 to a value | ||||||||||||||||||||||
def add_one(x): | ||||||||||||||||||||||
return x + 1 | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Apply the function element-wise using .map | ||||||||||||||||||||||
s.map(add_one) | ||||||||||||||||||||||
|
||||||||||||||||||||||
You can also apply UDFs to an entire DataFrame. For example: | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. ipython:: python | ||||||||||||||||||||||
|
||||||||||||||||||||||
df = pd.DataFrame({"A": [1, 2, 3], "B": [10, 20, 30]}) | ||||||||||||||||||||||
|
||||||||||||||||||||||
# UDF that takes a row and returns the sum of columns A and B | ||||||||||||||||||||||
def sum_row(row): | ||||||||||||||||||||||
return row["A"] + row["B"] | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Apply the function row-wise (axis=1 means apply across columns per row) | ||||||||||||||||||||||
df.apply(sum_row, axis=1) | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Why Not To Use User-Defined Functions | ||||||||||||||||||||||
------------------------------------- | ||||||||||||||||||||||
|
||||||||||||||||||||||
While UDFs provide flexibility, they come with significant drawbacks, primarily | ||||||||||||||||||||||
related to performance and behavior. When using UDFs, pandas must perform inference | ||||||||||||||||||||||
on the result, and that inference could be incorrect. Furthermore, unlike vectorized operations, | ||||||||||||||||||||||
UDFs are slower because pandas can't optimize their computations, leading to | ||||||||||||||||||||||
inefficient processing. | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. note:: | ||||||||||||||||||||||
In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Despite their drawbacks, UDFs can be helpful when: | ||||||||||||||||||||||
|
||||||||||||||||||||||
* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas' | ||||||||||||||||||||||
built-in methods cannot handle. | ||||||||||||||||||||||
* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas. | ||||||||||||||||||||||
* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support. | ||||||||||||||||||||||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
|
||||||||||||||||||||||
For example: | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. code-block:: python | ||||||||||||||||||||||
|
||||||||||||||||||||||
from sklearn.linear_model import LinearRegression | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Sample data | ||||||||||||||||||||||
df = pd.DataFrame({ | ||||||||||||||||||||||
'group': ['A', 'A', 'A', 'B', 'B', 'B'], | ||||||||||||||||||||||
'x': [1, 2, 3, 1, 2, 3], | ||||||||||||||||||||||
'y': [2, 4, 6, 1, 2, 1.5] | ||||||||||||||||||||||
}) | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Function to fit a model to each group | ||||||||||||||||||||||
def fit_model(group): | ||||||||||||||||||||||
model = LinearRegression() | ||||||||||||||||||||||
model.fit(group[['x']], group['y']) | ||||||||||||||||||||||
group['y_pred'] = model.predict(group[['x']]) | ||||||||||||||||||||||
return group | ||||||||||||||||||||||
|
||||||||||||||||||||||
result = df.groupby('group').apply(fit_model) | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Methods that support User-Defined Functions | ||||||||||||||||||||||
------------------------------------------- | ||||||||||||||||||||||
|
||||||||||||||||||||||
User-Defined Functions can be applied across various pandas methods: | ||||||||||||||||||||||
|
||||||||||||||||||||||
* :meth:`~DataFrame.apply` - A flexible method that allows applying a function to Series and | ||||||||||||||||||||||
DataFrames. | ||||||||||||||||||||||
* :meth:`~DataFrame.agg` (Aggregate) - Used for summarizing data, supporting custom | ||||||||||||||||||||||
aggregation functions. | ||||||||||||||||||||||
* :meth:`~DataFrame.transform` - Applies a function to Series and Dataframes while preserving the shape of | ||||||||||||||||||||||
the original data. | ||||||||||||||||||||||
* :meth:`~DataFrame.filter` - Filters Series and Dataframes based on a list of Boolean conditions. | ||||||||||||||||||||||
* :meth:`~DataFrame.map` - Applies an element-wise function to a Series or Dataframe, useful for | ||||||||||||||||||||||
transforming individual values. | ||||||||||||||||||||||
* :meth:`~DataFrame.pipe` - Allows chaining custom functions to process Series or | ||||||||||||||||||||||
Dataframes in a clean, readable manner. | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think about having this as a table? Personally I think it should make it easier to understand the differences about the methods. As a general idea:
Not sure if it makes sense to combine with the table below. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I agree with you, thanks for the suggestion! I will keep the two tables separate for now |
||||||||||||||||||||||
|
||||||||||||||||||||||
All of these pandas methods can be used with both Series and DataFrame objects, providing versatile | ||||||||||||||||||||||
ways to apply UDFs across different pandas data structures. | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. note:: | ||||||||||||||||||||||
Some of these methods are can also be applied to groupby, resample, and various window objects. | ||||||||||||||||||||||
See :ref:`groupby`, :ref:`resample()<timeseries>`, :ref:`rolling()<window>`, :ref:`expanding()<window>`, | ||||||||||||||||||||||
and :ref:`ewm()<window>` for details. | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Choosing the Right Method | ||||||||||||||||||||||
------------------------- | ||||||||||||||||||||||
When applying UDFs in pandas, it is essential to select the appropriate method based | ||||||||||||||||||||||
on your specific task. Each method has its strengths and is designed for different use | ||||||||||||||||||||||
cases. Understanding the purpose and behavior of each method will help you make informed | ||||||||||||||||||||||
decisions, ensuring more efficient and maintainable code. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Below is a table overview of all methods that accept UDFs: | ||||||||||||||||||||||
|
||||||||||||||||||||||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||||||||||||||||||||||
| Method | Purpose | Supports UDFs | Keeps Shape | Recommended Use Case | | ||||||||||||||||||||||
+==================+======================================+===========================+====================+==========================================+ | ||||||||||||||||||||||
| :meth:`apply` | General-purpose function | Yes | Yes (when axis=1) | Custom row-wise or column-wise operations| | ||||||||||||||||||||||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||||||||||||||||||||||
| :meth:`agg` | Aggregation | Yes | No | Custom aggregation logic | | ||||||||||||||||||||||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||||||||||||||||||||||
| :meth:`transform`| Transform without reducing dimensions| Yes | Yes | Broadcast element-wise transformations | | ||||||||||||||||||||||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||||||||||||||||||||||
| :meth:`map` | Element-wise mapping | Yes | Yes | Simple element-wise transformations | | ||||||||||||||||||||||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||||||||||||||||||||||
| :meth:`pipe` | Functional chaining | Yes | Yes | Building clean operation pipelines | | ||||||||||||||||||||||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||||||||||||||||||||||
| :meth:`filter` | Row/Column selection | Not directly | Yes | Subsetting based on conditions | | ||||||||||||||||||||||
+------------------+--------------------------------------+---------------------------+--------------------+------------------------------------------+ | ||||||||||||||||||||||
|
||||||||||||||||||||||
:meth:`DataFrame.apply` | ||||||||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
The :meth:`DataFrame.apply` allows you to apply UDFs along either rows or columns. While flexible, | ||||||||||||||||||||||
it is slower than vectorized operations and should be used only when you need operations | ||||||||||||||||||||||
that cannot be achieved with built-in pandas functions. | ||||||||||||||||||||||
|
||||||||||||||||||||||
When to use: :meth:`DataFrame.apply` is suitable when no alternative vectorized method or UDF method is available, | ||||||||||||||||||||||
but consider optimizing performance with vectorized operations wherever possible. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Documentation can be found at :meth:`~DataFrame.apply`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
:meth:`DataFrame.agg` | ||||||||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
If you need to aggregate data, :meth:`DataFrame.agg` is a better choice than apply because it is | ||||||||||||||||||||||
specifically designed for aggregation operations. | ||||||||||||||||||||||
|
||||||||||||||||||||||
When to use: Use :meth:`DataFrame.agg` for performing custom aggregations, where the operation returns | ||||||||||||||||||||||
a scalar value on each input. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Documentation can be found at :meth:`~DataFrame.agg`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
:meth:`DataFrame.transform` | ||||||||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
The transform method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame. | ||||||||||||||||||||||
It is generally faster than apply because it can take advantage of pandas' internal optimizations. | ||||||||||||||||||||||
|
||||||||||||||||||||||
When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Documentation can be found at :meth:`~DataFrame.transform`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. code-block:: python | ||||||||||||||||||||||
|
||||||||||||||||||||||
from sklearn.linear_model import LinearRegression | ||||||||||||||||||||||
|
||||||||||||||||||||||
df = pd.DataFrame({ | ||||||||||||||||||||||
'group': ['A', 'A', 'A', 'B', 'B', 'B'], | ||||||||||||||||||||||
'x': [1, 2, 3, 1, 2, 3], | ||||||||||||||||||||||
'y': [2, 4, 6, 1, 2, 1.5] | ||||||||||||||||||||||
}).set_index("x") | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Function to fit a model to each group | ||||||||||||||||||||||
def fit_model(group): | ||||||||||||||||||||||
x = group.index.to_frame() | ||||||||||||||||||||||
y = group | ||||||||||||||||||||||
model = LinearRegression() | ||||||||||||||||||||||
model.fit(x, y) | ||||||||||||||||||||||
pred = model.predict(x) | ||||||||||||||||||||||
return pred | ||||||||||||||||||||||
|
||||||||||||||||||||||
result = df.groupby('group').transform(fit_model) | ||||||||||||||||||||||
|
||||||||||||||||||||||
:meth:`DataFrame.filter` | ||||||||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
The :meth:`DataFrame.filter` method is used to select subsets of the DataFrame’s | ||||||||||||||||||||||
columns or row. It is useful when you want to extract specific columns or rows that | ||||||||||||||||||||||
match particular conditions. | ||||||||||||||||||||||
|
||||||||||||||||||||||
When to use: Use :meth:`DataFrame.filter` when you want to use a UDF to create a subset of a DataFrame or Series | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. note:: | ||||||||||||||||||||||
:meth:`DataFrame.filter` does not accept UDFs, but can accept | ||||||||||||||||||||||
list comprehensions that have UDFs applied to them. | ||||||||||||||||||||||
Comment on lines
+202
to
+204
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm unsure on having There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect the reason this was added is that I actually think |
||||||||||||||||||||||
|
||||||||||||||||||||||
.. ipython:: python | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Sample DataFrame | ||||||||||||||||||||||
df = pd.DataFrame({ | ||||||||||||||||||||||
'AA': [1, 2, 3], | ||||||||||||||||||||||
'BB': [4, 5, 6], | ||||||||||||||||||||||
'C': [7, 8, 9], | ||||||||||||||||||||||
'D': [10, 11, 12] | ||||||||||||||||||||||
}) | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Function that filters out columns where the name is longer than 1 character | ||||||||||||||||||||||
def is_long_name(column_name): | ||||||||||||||||||||||
return len(column_name) > 1 | ||||||||||||||||||||||
|
||||||||||||||||||||||
df_filtered = df.filter(items=[col for col in df.columns if is_long_name(col)]) | ||||||||||||||||||||||
print(df_filtered) | ||||||||||||||||||||||
|
||||||||||||||||||||||
Since filter does not directly accept a UDF, you have to apply the UDF indirectly, | ||||||||||||||||||||||
for example, by using list comprehensions. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Documentation can be found at :meth:`~DataFrame.filter`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
:meth:`DataFrame.map` | ||||||||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
:meth:`DataFrame.map` is used specifically to apply element-wise UDFs and is better | ||||||||||||||||||||||
for this purpose compared to :meth:`DataFrame.apply` because of its better performance. | ||||||||||||||||||||||
|
||||||||||||||||||||||
When to use: Use map for applying element-wise UDFs to DataFrames or Series. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Documentation can be found at :meth:`~DataFrame.map`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
:meth:`DataFrame.pipe` | ||||||||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
The pipe method is useful for chaining operations together into a clean and readable pipeline. | ||||||||||||||||||||||
It is a helpful tool for organizing complex data processing workflows. | ||||||||||||||||||||||
|
||||||||||||||||||||||
When to use: Use pipe when you need to create a pipeline of operations and want to keep the code readable and maintainable. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Documentation can be found at :meth:`~DataFrame.pipe`. | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Best Practices | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe just personal preference, but these last 3 sections seem to be talking about the same (performance), I'd have just a section about performance. I'd keep it short for now, and we can iterate over it later. The reason is that each time we review this before merging it we need to re-read the whole document. So, if we can finish the main part above first, and have this as a placeholder, then in a second PR we can focus more on performance without having to keep reviewing the first part again. |
||||||||||||||||||||||
-------------- | ||||||||||||||||||||||
|
||||||||||||||||||||||
While UDFs provide flexibility, their use is currently discouraged as they can introduce | ||||||||||||||||||||||
performance issues, especially when written in pure Python. To improve efficiency, | ||||||||||||||||||||||
consider using built-in ``NumPy`` or ``pandas`` functions instead of UDFs | ||||||||||||||||||||||
for common operations. | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. note:: | ||||||||||||||||||||||
If performance is critical, explore **vectorizated operations** before resorting | ||||||||||||||||||||||
to UDFs. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Vectorized Operations | ||||||||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
Below is a comparison of using UDFs versus using Vectorized Operations: | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. code-block:: python | ||||||||||||||||||||||
|
||||||||||||||||||||||
# User-defined function | ||||||||||||||||||||||
def calc_ratio(row): | ||||||||||||||||||||||
return 100 * (row["one"] / row["two"]) | ||||||||||||||||||||||
|
||||||||||||||||||||||
df["new_col"] = df.apply(calc_ratio, axis=1) | ||||||||||||||||||||||
|
||||||||||||||||||||||
# Vectorized Operation | ||||||||||||||||||||||
df["new_col2"] = 100 * (df["one"] / df["two"]) | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe worth mentioning and comparing also |
||||||||||||||||||||||
|
||||||||||||||||||||||
Measuring how long each operation takes: | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. code-block:: text | ||||||||||||||||||||||
|
||||||||||||||||||||||
User-defined function: 5.6435 secs | ||||||||||||||||||||||
Vectorized: 0.0043 secs | ||||||||||||||||||||||
|
||||||||||||||||||||||
Vectorized operations in pandas are significantly faster than using :meth:`DataFrame.apply` | ||||||||||||||||||||||
with UDFs because they leverage highly optimized C functions | ||||||||||||||||||||||
via NumPy to process entire arrays at once. This approach avoids the overhead of looping | ||||||||||||||||||||||
through rows in Python and making separate function calls for each row, which is slow and | ||||||||||||||||||||||
inefficient. Additionally, NumPy arrays benefit from memory efficiency and CPU-level | ||||||||||||||||||||||
optimizations, making vectorized operations the preferred choice whenever possible. | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
Improving Performance with UDFs | ||||||||||||||||||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||||||||||||||||||||
|
||||||||||||||||||||||
In scenarios where UDFs are necessary, there are still ways to mitigate their performance drawbacks. | ||||||||||||||||||||||
One approach is to use **Numba**, a Just-In-Time (JIT) compiler that can significantly speed up numerical | ||||||||||||||||||||||
Python code by compiling Python functions to optimized machine code at runtime. | ||||||||||||||||||||||
|
||||||||||||||||||||||
By annotating your UDFs with ``@numba.jit``, you can achieve performance closer to vectorized operations, | ||||||||||||||||||||||
especially for computationally heavy tasks. | ||||||||||||||||||||||
|
||||||||||||||||||||||
.. note:: | ||||||||||||||||||||||
You may also refer to the user guide on `Enhancing performance <https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation>`_ | ||||||||||||||||||||||
for a more detailed guide to using **Numba**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just personal opinion, but to me it makes more sense to explain what UDFs are in pandas before explaining when not to use them. This order seems reasonable assuming users already know what pandas udfs are in practice, but I'd personally prefer not to assume it in the user guide for UDFs.
In my opinion, after the previous introduction which is great, I'd show a very simple example so we make sure users reading this understand the very basics.
Something like:
Building on top of this, like then showing the same with a
DataFrame
, at some point showing UDFs that receive the whole column with.apply
... should help make sure users are following and understanding all the information provided here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit negative here. This is duplicating a lot of other documentation that we already have. I think we should instead link to that documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind pointing out to an specific example @rhshadrach? I found documentation for the aggregate functions, but not much for the
map
,apply
... onSeries
andDataFrame
other than in the API docs. I agree with not having much duplication. Personally, if there is few here and there like in the FAQs, Performance page... I'd rather have the docs related to these methods in this page, as it feels like the natural place, and link to the sections here in the FAQs, performance hints, groupby user guide... Of course there can be cases where it makes more sense the opposite, but maybe we can discuss the specific cases where there is duplication.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply: https://pandas.pydata.org/docs/user_guide/basics.html#row-or-column-wise-function-application
map: https://pandas.pydata.org/docs/user_guide/basics.html#applying-elementwise-functions
If we are going to move the docs on e.g.
DataFrame.agg
here, then this no longer is a page just about UDFs asDataFrame.agg
does more than just use UDFs. In addition, that seems like a large reworking of the docs for little (in my opinion, actually negative) benefit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed the Essential basic functionality page, thanks for pointing that out. Fully agree with you that what I proposed here is repeating again the whole https://pandas.pydata.org/docs/user_guide/basics.html#function-application section . And I agree that's not a good idea.
Personally, I'd rather not have that section, and have that content here. At least in my experience, map and apply are common, but not essential as other parts described in that page. And also, I think the structure of the user guide will be clearer and easier to find things with the changes.
For the
DataFrame.agg
, there is already a groupby page, and I think just having the methods in the lists of methods that support udfs would be good, and then just a mention that points out to the group by page where all the detail explanation regarding groupping is presented with examples.There may be other structures, but what I'd like is that we can give users structure to the related methods. I think
Series
has around 200 methods and attributes. Users having to navigate that whole API to find out themselves that map, apply and pipe are kind of the same just changing the input of the udf, doesn't seem ideal. I think this page here can really help in that.What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move the main document of
apply
here, then I am quite opposed to calling this a page on UDFs as apply does more than just take UDFs. By documentingapply("sum")
et al here, it seems to me we make this page far less clear than leaving it as solely UDFs.In any case, is that something you think should be tackled in this PR? This PR started as
I do not think we should morph it into moving around documentation from other places, especially when there are disagreements.
Which is why I think this page should be a comparison of UDF methods (as it mostly is now), while pointing to more thorough documentation elsewhere in the User Guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I think I understand your point better now. Maybe I'd like to improve a bit the apply/maps docs in essential, but that's unrelated to this PR. And happy to move forward here focussing on the UDFs and not on the methods, as you describe.