Parquet with different schemes fail in databricks loader (close #1085)

drphrozen · istreeter · commit 94e3509f2887 · 2022-09-28T14:05:15.000+01:00
We have an issue where we read data from multiple parquet files with different schemas (optional column only exist in some of the files).
It generates the following exception in Databricks:
`com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: [MISSING_COLUMN] Column 'unstruct_event_com_lego_3dcatalogue_like_product_1' does not exist. Did you mean one of the following?`

Recreating the issue in Databricks within a notebook and testing different options revealed we had to add the FORMAT_OPTIONS with mergeSchema to fix the issue.
diff --git a/modules/databricks-loader/src/main/scala/com/snowplowanalytics/snowplow/loader/databricks/Databricks.scala b/modules/databricks-loader/src/main/scala/com/snowplowanalytics/snowplow/loader/databricks/Databricks.scala
@@ -111,6 +111,7 @@ object Databricks {
                         SELECT $frSelectColumns from '$frPath' $frAuth
                       )
                       FILEFORMAT = PARQUET
+                      FORMAT_OPTIONS('MERGESCHEMA' = 'TRUE')
                       COPY_OPTIONS('MERGESCHEMA' = 'TRUE')""";
               case _: Statement.ShreddedCopy =>
                 throw new IllegalStateException("Databricks Loader does not support migrations")

Original file line number	Diff line number	Diff line change
`@@ -111,6 +111,7 @@ object Databricks {`
`111`	`111`	`SELECT $frSelectColumns from '$frPath' $frAuth`
`112`	`112`	`)`
`113`	`113`	`FILEFORMAT = PARQUET`
	`114`	`+ FORMAT_OPTIONS('MERGESCHEMA' = 'TRUE')`
`114`	`115`	`COPY_OPTIONS('MERGESCHEMA' = 'TRUE')""";`
`115`	`116`	`case _: Statement.ShreddedCopy =>`
`116`	`117`	`throw new IllegalStateException("Databricks Loader does not support migrations")`