-
Notifications
You must be signed in to change notification settings - Fork 75
[NSE-1161] Support read-write parquet conversion to read-write arrow #1162
[NSE-1161] Support read-write parquet conversion to read-write arrow #1162
Conversation
It seems we already have an experimental feature to overwrite vanilla spark's |
Thanks for your attension. And yes, they all can convert parquet to arrow fileformat, except that |
cc @zhouyuan . All suite tests have been passed. |
Running the full UT got some failures. I extracted some and posted them here.
|
@PHILO-HE Thanks for double check, I will check these tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.get("compression") | ||
.orElse(parquetCompressionConf) | ||
.getOrElse(session.sessionState.conf.parquetCompressionCodec) | ||
.toLowerCase(Locale.ROOT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compresssion is supported in this pr. Maybe merge this pr first, then I can remove these codes?
#1014
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks
…ap-project#1162) * add ArrowConvertExtension * do not convert parquet fileformat while writing to partitioned/bucketed/sorted output * fix cache failed * care about write codec * disable convertor extension by default * add some comments
* [NSE-1170] Set correct row number in batch scan w/ partition columns (#1172) * [NSE-1171] Throw RuntimeException when reading duplicate fields in case-insensitive mode (#1173) * throw exception if one more columns matched in case insensitive mode * add schema check in arrow v2 * bump h2/pgsql version (#1176) * bump h2/pgsql version Signed-off-by: Yuan Zhou <[email protected]> * ignore one failed test Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: Yuan Zhou <[email protected]> * [NSE-956] allow to write parquet with compression (#1014) This patch adds support for writing parquet with compression df.coalesce(1).write.format("arrow").option("parquet.compression","zstd").save(path) Signed-off-by: Yuan Zhou [email protected] * [NSE-1161] Support read-write parquet conversion to read-write arrow (#1162) * add ArrowConvertExtension * do not convert parquet fileformat while writing to partitioned/bucketed/sorted output * fix cache failed * care about write codec * disable convertor extension by default * add some comments * remove wrong compress type check (#1178) Since the compresssion has been supported in #1014 . The extra compression check in ArrowConvertorExtension can be remove now. * fix to use right arrow branch (#1179) fix to use right arrow branch Signed-off-by: Yuan Zhou <[email protected]> * [NSE-1171] Support merge parquet schema and read missing schema (#1175) * Support merge parquet schema and read missing schema * fix error * optimize null vectors * optimize code * optimize code * change code * add schema merge suite tests * add test for struct type * to use 1.5 branch arrow Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: Yuan Zhou [email protected] Co-authored-by: Jacky Lee <[email protected]>
What changes were proposed in this pull request?
In our scenario, users usually read or write hive parquet tables, which different from arrow datasource tables, thus we need a conversion rule to convert
ParquetFileFormat
toArrowFileFormat
.How was this patch tested?
unit tests and we have used it in our scenario.