Skip to content

Commit 94e3509

Browse files
drphrozenistreeter
authored andcommitted
Parquet with different schemes fail in databricks loader (close #1085)
We have an issue where we read data from multiple parquet files with different schemas (optional column only exist in some of the files). It generates the following exception in Databricks: `com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: [MISSING_COLUMN] Column 'unstruct_event_com_lego_3dcatalogue_like_product_1' does not exist. Did you mean one of the following?` Recreating the issue in Databricks within a notebook and testing different options revealed we had to add the FORMAT_OPTIONS with mergeSchema to fix the issue.
1 parent 1cc45f9 commit 94e3509

File tree

1 file changed

+1
-0
lines changed
  • modules/databricks-loader/src/main/scala/com/snowplowanalytics/snowplow/loader/databricks

1 file changed

+1
-0
lines changed

modules/databricks-loader/src/main/scala/com/snowplowanalytics/snowplow/loader/databricks/Databricks.scala

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ object Databricks {
111111
SELECT $frSelectColumns from '$frPath' $frAuth
112112
)
113113
FILEFORMAT = PARQUET
114+
FORMAT_OPTIONS('MERGESCHEMA' = 'TRUE')
114115
COPY_OPTIONS('MERGESCHEMA' = 'TRUE')""";
115116
case _: Statement.ShreddedCopy =>
116117
throw new IllegalStateException("Databricks Loader does not support migrations")

0 commit comments

Comments
 (0)