Skip to content

Commit 3175f4b

Browse files
imback82HyukjinKwon
authored andcommitted
[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala
### What changes were proposed in this pull request? This PR changes the behavior of `Column.getItem` to call `Column.getItem` on Scala side instead of `Column.apply`. ### Why are the changes needed? The current behavior is not consistent with that of Scala. In PySpark: ```Python df = spark.range(2) map_col = create_map(lit(0), lit(100), lit(1), lit(200)) df.withColumn("mapped", map_col.getItem(col('id'))).show() # +---+------+ # | id|mapped| # +---+------+ # | 0| 100| # | 1| 200| # +---+------+ ``` In Scala: ```Scala val df = spark.range(2) val map_col = map(lit(0), lit(100), lit(1), lit(200)) // The following getItem results in the following exception, which is the right behavior: // java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column id // at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) // at org.apache.spark.sql.Column.getItem(Column.scala:856) // ... 49 elided df.withColumn("mapped", map_col.getItem(col("id"))).show ``` ### Does this PR introduce any user-facing change? Yes. If the use wants to pass `Column` object to `getItem`, he/she now needs to use the indexing operator to achieve the previous behavior. ```Python df = spark.range(2) map_col = create_map(lit(0), lit(100), lit(1), lit(200)) df.withColumn("mapped", map_col[col('id'))].show() # +---+------+ # | id|mapped| # +---+------+ # | 0| 100| # | 1| 200| # +---+------+ ``` ### How was this patch tested? Existing tests. Closes apache#26351 from imback82/spark-29664. Authored-by: Terry Kim <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent 8a8ac00 commit 3175f4b

File tree

3 files changed

+21
-8
lines changed

3 files changed

+21
-8
lines changed

docs/pyspark-migration-guide.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,9 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.
8484

8585
- Since Spark 3.0, `createDataFrame(..., verifySchema=True)` validates `LongType` as well in PySpark. Previously, `LongType` was not verified and resulted in `None` in case the value overflows. To restore this behavior, `verifySchema` can be set to `False` to disable the validation.
8686

87+
- Since Spark 3.0, `Column.getItem` is fixed such that it does not call `Column.apply`. Consequently, if `Column` is used as an argument to `getItem`, the indexing operator should be used.
88+
For example, `map_col.getItem(col('id'))` should be replaced with `map_col[col('id')]`.
89+
8790
## Upgrading from PySpark 2.3 to 2.4
8891

8992
- In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`.

python/pyspark/sql/column.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -296,14 +296,12 @@ def getItem(self, key):
296296
+----+------+
297297
| 1| value|
298298
+----+------+
299-
>>> df.select(df.l[0], df.d["key"]).show()
300-
+----+------+
301-
|l[0]|d[key]|
302-
+----+------+
303-
| 1| value|
304-
+----+------+
299+
300+
.. versionchanged:: 3.0
301+
If `key` is a `Column` object, the indexing operator should be used instead.
302+
For example, `map_col.getItem(col('id'))` should be replaced with `map_col[col('id')]`.
305303
"""
306-
return self[key]
304+
return _bin_op("getItem")(self, key)
307305

308306
@since(1.3)
309307
def getField(self, name):

python/pyspark/sql/tests/test_column.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818

1919
import sys
2020

21+
from py4j.protocol import Py4JJavaError
22+
2123
from pyspark.sql import Column, Row
2224
from pyspark.sql.types import *
2325
from pyspark.sql.utils import AnalysisException
@@ -85,14 +87,24 @@ def test_column_operators(self):
8587
"Cannot apply 'in' operator against a column",
8688
lambda: 1 in cs)
8789

88-
def test_column_getitem(self):
90+
def test_column_apply(self):
8991
from pyspark.sql.functions import col
9092

9193
self.assertIsInstance(col("foo")[1:3], Column)
9294
self.assertIsInstance(col("foo")[0], Column)
9395
self.assertIsInstance(col("foo")["bar"], Column)
9496
self.assertRaises(ValueError, lambda: col("foo")[0:10:2])
9597

98+
def test_column_getitem(self):
99+
from pyspark.sql.functions import col, create_map, lit
100+
101+
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
102+
self.assertRaisesRegexp(
103+
Py4JJavaError,
104+
"Unsupported literal type class org.apache.spark.sql.Column id",
105+
lambda: map_col.getItem(col('id'))
106+
)
107+
96108
def test_column_select(self):
97109
df = self.df
98110
self.assertEqual(self.testData, df.select("*").collect())

0 commit comments

Comments
 (0)