Skip to content

How to enable transactions in large insert via spark #950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chenrun0210 opened this issue Jun 9, 2022 · 4 comments · Fixed by #1008
Closed

How to enable transactions in large insert via spark #950

chenrun0210 opened this issue Jun 9, 2022 · 4 comments · Fixed by #1008

Comments

@chenrun0210
Copy link

chenrun0210 commented Jun 9, 2022

Problem:
Batch insert into clickhouse using spark api Dataset.write.jdbc(). But when the spark job fail (including some task fail, and job fail, and executor lost) , data in clickhouse is not correct, usually more than origin data because of SPARK RETRY MECHANISM.
Code be like :
Dataset df = toclickhouse.spark.sql(" my query sql ").as(encoder);
df.write()
.mode("append")
.option("driver","com.clickhouse.jdbc.ClickHouseDriver")
.jdbc(Config.jdbcUrlAB, "clickhouse_table", Config.ckPropertiesAB);

Config.ckPropertiesAB be like:
public static Properties ckPropertiesAB = new Properties();
static {
    ckPropertiesAB.put("user", "xxxx");
    ckPropertiesAB.put("password", "xxxx");
    ckPropertiesAB.put("batchsize", "200000");
    ckPropertiesAB.put("rewriteBatchedStatements", "true");
}

Clickhouse_table :  ReplicatedMergeTree
jdbc driver: 
    <dependency>
        <groupId>com.clickhouse</groupId>
        <artifactId>clickhouse-jdbc</artifactId>
        <version>0.3.2-patch8</version>
        <classifier>http</classifier>
        <exclusions>
            <exclusion>
                <groupId>*</groupId>
                <artifactId>*</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

when some task in spark job fails. data in clickhouse table is wrong.

Then i find Clickhouse Limited support for transactions in MergeTree tables :
ClickHouse/ClickHouse#22086

If large INSERT in single transaction, my problem would be solved perfectly.
But i didnt find how to use this feature in clickhouse-jdbc, and how to use it in spark api Dataset.write.jdbc(), is it to config the properties in Dataset.write.jdbc(url, table, properties) ?

@zhicwu
Copy link
Contributor

zhicwu commented Jun 10, 2022

Hi @chenrun0210, sorry I'm not familiar with the feature. Is that just about adding new settings for each insert? Can you share more details?

The JDBC driver has fake transaction built-in for JDBC compliance and I should be able to enhance for atomic insertion as well.

@chenrun0210
Copy link
Author

i am sorry i wrote the wrong link, the right link is:
ClickHouse/ClickHouse#22086
related pr :
ClickHouse/ClickHouse#24258

The JDBC driver has fake transaction built-in for JDBC compliance and I should be able to enhance for atomic insertion as well.

-- has fake transaction built-in for JDBC compliance : does it mean the config about transaction of clickhouse jdbc actually dont work? if it works ,how do i config it

@zhicwu
Copy link
Contributor

zhicwu commented Jun 10, 2022

I see. Yes, as of now, it's not supported. However, since JDBC driver supports executing multiple statements in one go, a workaround would be executing TCL manually, together with insertion. For example:

begin transaction;
insert into mytable values (1),(10);
commit;
-- execute rollback when there's exception

Another workaround would be using clickhouse-local to generate parts, copy them to ClickHouse server, and attach as needed.

@zhicwu zhicwu added this to the 0.3.3 release milestone Jun 26, 2022
@zhicwu zhicwu modified the milestones: 0.3.3 release, 0.3.2-patch11 Jul 15, 2022
@zhicwu zhicwu linked a pull request Jul 24, 2022 that will close this issue
@zhicwu
Copy link
Contributor

zhicwu commented Jul 28, 2022

If you can upgrade ClickHouse to 22.7, you don't have to upgrade JDBC driver but simply add connection property custom_http_headers=implicit_transaction%3D1 to enable implicit transaction.

@zhicwu zhicwu closed this as completed Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants