Description
Problem:
Batch insert into clickhouse using spark api Dataset.write.jdbc(). But when the spark job fail (including some task fail, and job fail, and executor lost) , data in clickhouse is not correct, usually more than origin data because of SPARK RETRY MECHANISM.
Code be like :
Dataset df = toclickhouse.spark.sql(" my query sql ").as(encoder);
df.write()
.mode("append")
.option("driver","com.clickhouse.jdbc.ClickHouseDriver")
.jdbc(Config.jdbcUrlAB, "clickhouse_table", Config.ckPropertiesAB);
Config.ckPropertiesAB be like:
public static Properties ckPropertiesAB = new Properties();
static {
ckPropertiesAB.put("user", "xxxx");
ckPropertiesAB.put("password", "xxxx");
ckPropertiesAB.put("batchsize", "200000");
ckPropertiesAB.put("rewriteBatchedStatements", "true");
}
Clickhouse_table : ReplicatedMergeTree
jdbc driver:
<dependency>
<groupId>com.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.3.2-patch8</version>
<classifier>http</classifier>
<exclusions>
<exclusion>
<groupId>*</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
when some task in spark job fails. data in clickhouse table is wrong.
Then i find Clickhouse Limited support for transactions in MergeTree tables :
ClickHouse/ClickHouse#22086
If large INSERT in single transaction, my problem would be solved perfectly.
But i didnt find how to use this feature in clickhouse-jdbc, and how to use it in spark api Dataset.write.jdbc(), is it to config the properties in Dataset.write.jdbc(url, table, properties) ?