You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use cases for `map` in the `group_by` context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why.
44
+
Use cases for `map_batches` in the `group_by` context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why.
Ouch.. we clearly get the wrong results here. Group `"b"` even got a value from group `"a"` 😵.
90
76
91
-
This went horribly wrong, because the `map` applies the function before we aggregate! So that means the whole column `[10, 7, 1`\] got shifted to `[null, 10, 7]` and was then aggregated.
77
+
This went horribly wrong, because the `map_batches` applies the function before we aggregate! So that means the whole column `[10, 7, 1`\] got shifted to `[null, 10, 7]` and was then aggregated.
92
78
93
-
So my advice is to never use `map` in the `group_by` context unless you know you need it and know what you are doing.
79
+
So my advice is to never use `map_batches` in the `group_by` context unless you know you need it and know what you are doing.
94
80
95
-
## To `apply`
81
+
## To `map_elements`
96
82
97
-
Luckily we can fix previous example with `apply`. `apply` works on the smallest logical elements for that operation.
83
+
Luckily we can fix previous example with `map_elements`. `map_elements` works on the smallest logical elements for that operation.
98
84
99
85
That is:
100
86
101
87
-`select context` -> single elements
102
88
-`group by context` -> single groups
103
89
104
-
So with `apply` we should be able to fix our example:
90
+
So with `map_elements` we should be able to fix our example:
In the `select` context, the `apply` expression passes elements of the column to the Python function.
102
+
In the `select` context, the `map_elements` expression passes elements of the column to the Python function.
117
103
118
104
_Note that you are now running Python, this will be slow._
119
105
120
106
Let's go through some examples to see what to expect. We will continue with the `DataFrame` we defined at the start of
121
-
this section and show an example with the `apply` function and a counter example where we use the expression API to
107
+
this section and show an example with the `map_elements` function and a counter example where we use the expression API to
122
108
achieve the same goals.
123
109
124
110
### Adding a counter
125
111
126
112
In this example we create a global `counter` and then add the integer `1` to the global state at every element processed.
127
113
Every iteration the result of the increment will be added to the element value.
128
114
129
-
> Note, this example isn't provided in Rust. The reason is that the global `counter` value would lead to data races when this apply is evaluated in parallel. It would be possible to wrap it in a `Mutex` to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees.
115
+
> Note, this example isn't provided in Rust. The reason is that the global `counter` value would lead to data races when this `apply` is evaluated in parallel. It would be possible to wrap it in a `Mutex` to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees.
If we want to have access to values of different columns in a single `apply` function call, we can create `struct` data
125
+
If we want to have access to values of different columns in a single `map_elements` function call, we can create `struct` data
140
126
type. This data type collects those columns as fields in the `struct`. So if we'd create a struct from the columns
141
127
`"keys"` and `"values"`, we would get the following struct elements:
142
128
@@ -150,7 +136,7 @@ type. This data type collects those columns as fields in the `struct`. So if we'
150
136
151
137
In Python, those would be passed as `dict` to the calling Python function and can thus be indexed by `field: str`. In Rust, you'll get a `Series` with the `Struct` type. The fields of the struct can then be indexed and downcast.
0 commit comments