@@ -14,14 +14,15 @@ The pandas API most likely cannot efficiently handle the complexity of the aggre
14
14
```python exec="true" source="above" result="python" session="df_ex1"
15
15
import narwhals as nw
16
16
import pandas as pd
17
+ from narwhals.typing import IntoFrameT
17
18
18
19
data = {"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1], "c": [10, 20, 30, 40, 50]}
19
20
20
21
df_pd = pd.DataFrame(data)
21
22
22
23
23
24
@nw.narwhalify
24
- def approach_1(df) :
25
+ def approach_1(df: IntoFrameT) -> IntoFrameT :
25
26
26
27
# Pay attention to this next line
27
28
df = df.group_by("a").agg(d=(nw.col("b") + nw.col("c")).sum())
@@ -43,7 +44,7 @@ The pandas API most likely cannot efficiently handle the complexity of the aggre
43
44
44
45
45
46
@nw.narwhalify
46
- def approach_2(df) :
47
+ def approach_2(df: IntoFrameT) -> IntoFrameT :
47
48
48
49
# Pay attention to this next line
49
50
df = df.with_columns(d=nw.col("b") + nw.col("c")).group_by("a").agg(nw.sum("d"))
@@ -54,40 +55,45 @@ The pandas API most likely cannot efficiently handle the complexity of the aggre
54
55
print(approach_2(df_pd))
55
56
```
56
57
57
-
58
58
Both Approaches shown above return the exact same result, but Approach 1 is inefficient and returns the warning message
59
59
we showed at the top.
60
60
61
61
What makes the first approach inefficient and the second approach efficient? It comes down to what the
62
62
pandas API lets us express.
63
63
64
64
## Approach 1
65
+
65
66
``` python
66
67
# From line 11
67
68
68
69
return df.group_by(" a" ).agg((nw.col(" b" ) + nw.col(" c" )).sum().alias(" d" ))
69
70
```
70
71
71
72
To translate this to pandas, we would do:
73
+
72
74
``` python
73
75
df.groupby(" a" ).apply(
74
76
lambda df : pd.Series([(df[" b" ] + df[" c" ]).sum()], index = [" d" ]), include_groups = False
75
77
)
76
78
```
79
+
77
80
Any time you use ` apply ` in pandas, that's a performance footgun - best to avoid it and use vectorised operations instead.
78
81
Let's take a look at how "approach 2" gets translated to pandas to see the difference.
79
82
80
83
## Approach 2
84
+
81
85
``` python
82
86
# Line 11 in Approach 2
83
87
84
88
return df.with_columns(d = nw.col(" b" ) + nw.col(" c" )).group_by(" a" ).agg({" d" : " sum" })
85
89
```
86
90
87
91
This gets roughly translated to:
92
+
88
93
``` python
89
94
df.assign(d = lambda df : df[" b" ] + df[" c" ]).groupby(" a" ).agg({" d" : " sum" })
90
95
```
96
+
91
97
Because we're using pandas' own API, as opposed to ` apply ` and a custom ` lambda ` function, then this is going to be much more efficient.
92
98
93
99
## Tips for Avoiding the ` UserWarning `
0 commit comments