Skip to content

How to enable transactions in large insert via spark  #950

Closed
@chenrun0210

Description

@chenrun0210

Problem:
Batch insert into clickhouse using spark api Dataset.write.jdbc(). But when the spark job fail (including some task fail, and job fail, and executor lost) , data in clickhouse is not correct, usually more than origin data because of SPARK RETRY MECHANISM.
Code be like :
Dataset df = toclickhouse.spark.sql(" my query sql ").as(encoder);
df.write()
.mode("append")
.option("driver","com.clickhouse.jdbc.ClickHouseDriver")
.jdbc(Config.jdbcUrlAB, "clickhouse_table", Config.ckPropertiesAB);

Config.ckPropertiesAB be like:
public static Properties ckPropertiesAB = new Properties();
static {
    ckPropertiesAB.put("user", "xxxx");
    ckPropertiesAB.put("password", "xxxx");
    ckPropertiesAB.put("batchsize", "200000");
    ckPropertiesAB.put("rewriteBatchedStatements", "true");
}

Clickhouse_table :  ReplicatedMergeTree
jdbc driver: 
    <dependency>
        <groupId>com.clickhouse</groupId>
        <artifactId>clickhouse-jdbc</artifactId>
        <version>0.3.2-patch8</version>
        <classifier>http</classifier>
        <exclusions>
            <exclusion>
                <groupId>*</groupId>
                <artifactId>*</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

when some task in spark job fails. data in clickhouse table is wrong.

Then i find Clickhouse Limited support for transactions in MergeTree tables :
ClickHouse/ClickHouse#22086

If large INSERT in single transaction, my problem would be solved perfectly.
But i didnt find how to use this feature in clickhouse-jdbc, and how to use it in spark api Dataset.write.jdbc(), is it to config the properties in Dataset.write.jdbc(url, table, properties) ?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions