How to enable transactions in large insert via spark 

Problem:
Batch insert into clickhouse using spark api  Dataset.write.jdbc(). But when the spark job fail （including some task fail,  and job fail, and executor lost） , data in clickhouse is not correct, usually more than origin data because of SPARK RETRY MECHANISM.
Code be like ：
        Dataset<VideoMartDwdUserBehaviorDi> df = toclickhouse.spark.sql(" my query sql ").as(encoder);
        df.write()
                .mode("append")
                .option("driver","com.clickhouse.jdbc.ClickHouseDriver")
                .jdbc(Config.jdbcUrlAB, "clickhouse_table", Config.ckPropertiesAB);


    Config.ckPropertiesAB be like:
    public static Properties ckPropertiesAB = new Properties();
    static {
        ckPropertiesAB.put("user", "xxxx");
        ckPropertiesAB.put("password", "xxxx");
        ckPropertiesAB.put("batchsize", "200000");
        ckPropertiesAB.put("rewriteBatchedStatements", "true");
    }

    Clickhouse_table :  ReplicatedMergeTree
    jdbc driver: 
        <dependency>
            <groupId>com.clickhouse</groupId>
            <artifactId>clickhouse-jdbc</artifactId>
            <version>0.3.2-patch8</version>
            <classifier>http</classifier>
            <exclusions>
                <exclusion>
                    <groupId>*</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
 when some task in spark job fails. data in clickhouse table is wrong.


Then i find Clickhouse  Limited support for transactions in MergeTree tables :
https://github.com/ClickHouse/ClickHouse/issues/22086

If large INSERT in single transaction, my problem would be solved perfectly.
But i didnt  find how to use this feature in clickhouse-jdbc, and how to use it in  spark api  Dataset.write.jdbc(), is it to config the properties   in      Dataset.write.jdbc(url, table, properties)  ?







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to enable transactions in large insert via spark #950

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to enable transactions in large insert via spark #950

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions