-
Notifications
You must be signed in to change notification settings - Fork 585
Possible way of inserting a lot of CSV files #1563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would pass CSV files as a CSV stream. see examples #909 |
Thanks, I'll try that in a few days! |
@den-crane, thanks a lot! That's working kinda fine, however, if I'm trying to insert a really big CSV file with 12mln rows about 1GB, I end up with inserting only 1mln rows (they are inserting like less then a minute I presume) and then in 5 min I got execution time out. Could you possible have any ideas how to fix it? I'm thinkink about open up a file and divide it in some separate files with multithreading and partitions. |
First - check time with clickhouse-client to get a baseline for such file. |
Yes, I've done that before, sorry for not mentioning - with CLI 1GB file with 12 mln rows - 60sec With JDBC, for example - about 4 min to insert such a file. |
It's probably related to a network. Check the network, how fast you can transfer such file from a client to a server. |
I'm inserting the same file with clickhouse-client connected to the same server on clickhouse.cloud. So that's why I presume it's not a network problem, or it might be, even when I am connecting through the same machine but different platforms? (CLI and Java I mean) I'll try compressing, is there any doc I could find about using it? My CSV files are not compressed. UPD: So I'm enabling compression for my connection like this .addOption(ClickHouseClientOption.COMPRESS.getKey(), "true")
.addOption(ClickHouseClientOption.COMPRESS_ALGORITHM.getKey(), "ZSTD") But I got an exception that ZSTD is not supported somehow. I couldn't find any problem-related reasons in google, so I've tried to LZ4, but still - it's really slow, like 40k rows/sec. By far, there are all my connection settings: public static ClickHouseNode initJavaClientConnection() {
Properties properties = PropertiesLoader.loadJDBCProp();
return ClickHouseNode.builder()
.host(properties.getProperty("HOST"))
.port(ClickHouseProtocol.HTTP, Integer.valueOf(properties.getProperty("PORT")))
.database(properties.getProperty("DATABASE"))
.credentials(
ClickHouseCredentials.fromUserAndPassword(
properties.getProperty("USERNAME"), properties.getProperty("PASSWORD")))
.addOption(ClickHouseClientOption.SSL.getKey(), properties.getProperty("SSL"))
.addOption(
ClickHouseHttpOption.CUSTOM_PARAMS.getKey(), "async_insert=1, wait_for_async_insert=1")
.addOption(ClickHouseClientOption.SOCKET_TIMEOUT.getKey(), "300000")
.addOption(ClickHouseClientOption.MAX_EXECUTION_TIME.getKey(), "300")
.addOption(ClickHouseClientOption.COMPRESS.getKey(), "true")
.addOption(ClickHouseClientOption.COMPRESS_ALGORITHM.getKey(), "ZSTD")
.build();
} It's no difference when using compression settings with LZ4, or not. Speed of inserting is the same still. This my code of inserting, please don't mind no refactoring there - My point first to get work done, then I would like, but for now: public void readExecutor() {
List<String> ticketNames = getFilesInDirectory();
List<List<String>> ticketParts =
Lists.partition(ticketNames, ticketNames.size() / PARTS_QUANTITY);
ClickHouseNode server = ConnectionHandler.initJavaClientConnection();
ClickHouseClient client = ClickHouseClient.newInstance(server.getProtocol());
try (ExecutorService service = Executors.newFixedThreadPool(THREADS_COUNT)) {
clickHouseDAO = new ClickHouseDAO(server);
clickHouseDAO.truncateTable();
for (List<String> ticketPartition : ticketParts) {
service.execute(
() -> {
for (String fileName : ticketPartition) {
Path filePath = Paths.get(SOURCE_PATH + "/" + fileName);
try (ClickHouseResponse response =
client
.write(server)
.query("INSERT INTO tickets_data_db.tickets_data")
.format(ClickHouseFormat.CSV)
.data(filePath.toString())
.executeAndWait()) {
} catch (ClickHouseException e) {
throw new RuntimeException(e);
}
}
});
}
} finally {
client.close();
}
} I made this because I was trying to ensure that I'm using the same connection for all my threads. When I was trying to open connection for every thread I was getting some errors like |
clickhouse-connect uses LZ4 for a communication with remote servers (not localhost). There are 2 different compressions. You have set LZ4 compression for session -- query results and batched inserts. But you need to set compression option for a particular
|
So you are advising to implement some stream with compressing csv files when pass it to a data function? Should it be a class derived from FileInputStream? |
try to compress that file using cli gzip and then check does it help or not |
All my files stored are just by themself as file with no any extension. |
You have one big file xxxx, any file. |
I'm sorry for the inconvience, have I got it right: You want me to compress my big file with gzip by some any tools, and upload it on the server with JavaClient also with an addition as I'm not aware of all posibilities of JavaClient for the ClickHouse cause I couldn't find any docs about classes and object except only how to use JavaClient for simple queries. |
All my files were compressed with Appache Commons Compress for LZ4 - as it was kinda fast to compress all of them. service.execute(
() -> {
for (String fileName : ticketPartition) {
try (ClickHouseResponse response =
client
.write(server)
.query("INSERT INTO tickets_data_db.tickets_data")
.format(ClickHouseFormat.CSV)
.data(fileName, ClickHouseCompression.LZ4)
.executeAndWait()) {
} catch (ClickHouseException e) {
throw new RuntimeException(e);
}
}
}); Caused by: com.clickhouse.client.ClickHouseException: Code: 432. DB::Exception: Unknown codec family code: 57. (UNKNOWN_CODEC) (version 24.1.2.10900 (official build)) I'm going to try to compress with ZSTD for now, but still, this kinda strange. |
Clickhouse uses own format for LZ4 blocks. |
So it's not possible to load files compressed with LZ4? I'll try another format then. |
Compressing files with However, is there any ways to reach a speed about 200-250k rows/sec like with Clickhouse CLI in JavaClient? Maybe some more optimizations to apply? .addOption(ClickHouseHttpOption.CUSTOM_PARAMS.getKey(), "async_insert=1, wait_for_async_insert=1")
.addOption(ClickHouseClientOption.SOCKET_TIMEOUT.getKey(), "300000")
.addOption(ClickHouseClientOption.MAX_EXECUTION_TIME.getKey(), "300") Async insert is enabled, but I think it's not necessary because I am using one connection instance because it's results in more stable connection and no exception about read time out or ssl peer shutdown incorreclty. |
I want to do something like this: public void insertFromCompressedFileStream(PipedInputStream pin) {
try (ClickHouseResponse response =
client
.write(server)
.query("INSERT INTO tickets_data_db.tickets_data")
.format(ClickHouseFormat.CSV)
.data(pin, ClickHouseCompression.GZIP)
.executeAndWait()) {
} catch (ClickHouseException e) {
throw new RuntimeException(e);
}
} However, as you know, there is no overloading for data function that accepts first parameter as stream and second one as compression, so with code like
Removed |
It seems it's not possible without the changing the code of http client, though the needed change is fairly simple. But anyway you clearly misunderstood me. I just wanted to pinpoint the bottleneck. Clearly it's a network. Maybe easier to investigate the issue using simple curl not java. Just use your biggest CSV file, try to load it using clickhouse-client, then compress it using ZSTD and try to load it using curl https://clickhouse.com/docs/en/interfaces/http example. Then compare the results. |
I want not to use JDBC because I am thinking that it would be much slower then JavaClient. I've tried to load CSV files, not compressed, like this: #1402 (comment) Also, I am really curios about network issue, because I'm like simply loading data from two different apps (ports, clients), so that why I'm wondering should it be a network problem, because it's just different clients, but network still appears the same - both clients using clickhouse.cloud as connection point. With curl query like this: |
I've managed to do something like this: public void insertFromCompressedFileStream(PipedInputStream pin) {
try (ClickHouseResponse response =
client
.write(server)
.query("INSERT INTO tickets_data_db.tickets_data")
.data(ClickHousePassThruStream.of(pin ,ClickHouseCompression.GZIP, ClickHouseFormat.CSV))
.executeAndWait()) {
} catch (ClickHouseException e) {
try
{
pin.close();
} catch (IOException ex)
{
// TODO: add logging
}
throw new RuntimeException(e);
}
} It reached a speed about 350k rows/sec, so almost like a Clickhouse-Client. Data from Also, I think, there in docs should be a mention about using some input stream with Compression and Format specified. So maybe we could add it someday. |
Uh oh!
There was an error while loading. Please reload this page.
Hello, my use case is this:
I have a directory with a lot of CSV files ~ 1.2k, everyone of them contains about from ~10k to ~10-20 mln rows.
I want to insert data from this files to my clickhouse-DB as fast, as possible, by using JavaClient (I've tried with JDBC - I reached a speed about only ~50-70k rows/sec what results in 10 minutes insertion of 30mln rows.)
For sure, I can read this directory with many threads at once, so one thread taking only one chunk of files from this directory, but still - perfomance with JDBC is quite slow, because CLI Client able to load the same directory of 1.2k files with 30 mln rows in total in 57 sec.
I am aware that with JavaClient I could pass my data as the binaryStream, so it might increase the perfomance. However, after some researching in global network and this issues page still, I'm having some error that I'm not quite sure how to fix:
Error is: ClickHouseException: Code: 159. Execution timed out, server ClickHouseNode
So I've got to insert only 700k rows from possibly 50mln.
addOption(ClickHouseClientOption.SOCKET_TIMEOUT.getKey(), "60000")
Doesn't help much, still got timeout after 1 min.I've tried something like this: ClickHouse Example but still got an exception timeout and not even any data was downloaded at all
This is a sample code, where
List<String> data
- contains all lines from csv file, so one String from this List will be a one record.I'm thinking, that obviously, there might be some error with request & response variables, especially with their initialization.
Also, as you understand, my table contains different types.
The text was updated successfully, but these errors were encountered: