Skip to content

Feature request: Rotation based on maximum file size on hdfs. #365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomLous opened this issue Aug 21, 2018 · 8 comments
Open

Feature request: Rotation based on maximum file size on hdfs. #365

TomLous opened this issue Aug 21, 2018 · 8 comments

Comments

@TomLous
Copy link

TomLous commented Aug 21, 2018

I'd like the option to specify the maximum file size for the hdfs connector to write before rotating.
I understand the only way to do this is to approximate it by setting flushSize (based on # records) or time interval.
The reason is that it's very useful to keep files at the approximate size of the hdfs block size, but no more. This gives us a more fine grained control and the assurance that we won't end up with to large files, or worse, many small files.
I'd like to add this feature to the codebase myself if possible. Are there any restrictions / guidelines I have to take into account if I wan the possibility to merge these changes back into the codebase?
Or was this feature previously explored and abandoned for some reason?

@OneCricketeer
Copy link

OneCricketeer commented Sep 11, 2018

I think this would be useful change, if handled appropriately. For example a schema could change within the middle of a file (in any format, not necessarily Avro or Parquet). Text or JSON might not be that big of a deal, but would be annoying if a process failed parsing 8 CSV columns for example, but only got 7.

Plus, time based partitioning should not wait for the size to be reached before writing the data into the "truncated" datetime for a given partitioner

The "workarounds" are all downstream processors that I've enumerated in #271

@kkonstantine
Copy link
Member

kkonstantine commented Sep 19, 2018

@TomLous this feature is useful indeed. It hasn't been rejected previously, as far as I know.
We'd need an elegant way to track bytes exported and when the limit is about to be reached (or has just been exceeded).

@Cricket007 the partitioning on size should not split records and should respect other partitioning criteria.

@TomLous
Copy link
Author

TomLous commented Sep 19, 2018

Thanks for the feedback.

I've already created an implementation that respects other filter criteria for my current client. I'm waiting for their legal team to get back to me, so I can create a PR.

@d33play
Copy link

d33play commented Apr 25, 2019

Any progress on this feature?

@TomLous
Copy link
Author

TomLous commented Apr 25, 2019

Sorry, not from my end. The OK was never given at eBay and now I'm no longer working there.
I'm also not allowed/able to share progress we made on this feature unfortunately.
I can say that it's pretty hard without major rewrite, because the logic in the TopicPartitionWriter is based on, guess what, partitions, instead of individual files. So the best we could implement is if 1 file reached the max size => rotate all. It's not pretty and in the end we moved away from KafkaConnect and moved to Flink to have more fine grained control over the HDFS files.

@GalDayan
Copy link

Any progress on this feature guys?

@vipinkumar7
Copy link

vipinkumar7 commented May 5, 2020

Here
https://github.com/confluentinc/kafka-connect-hdfs/compare/5.5.x...vipinkumar7:5.5.x?expand=1

This is what I implemented for Sequence file , write to file in append mode till reach file size limit and then only do final commit
till then it will keep writing to same temp file

Let me know guys what are your input so I will refactor it

Usiel added a commit to Usiel/kafka-connect-hdfs that referenced this issue Nov 9, 2023
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature.
A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation
is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached
the configured limit.

See: confluentinc#365
Usiel added a commit to Usiel/kafka-connect-hdfs that referenced this issue Nov 9, 2023
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature.
A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation
is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached
the configured limit.

See: confluentinc#365
Usiel added a commit to Usiel/kafka-connect-hdfs that referenced this issue Nov 10, 2023
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature.
A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation
is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached
the configured limit.

See: confluentinc#365
Usiel added a commit to Usiel/kafka-connect-hdfs that referenced this issue Nov 10, 2023
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature.
A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation
is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached
the configured limit.

See: confluentinc#365
Usiel added a commit to Usiel/kafka-connect-hdfs that referenced this issue Nov 10, 2023
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature.
A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation
is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached
the configured limit.

See: confluentinc#365
Usiel added a commit to Usiel/kafka-connect-hdfs that referenced this issue Nov 10, 2023
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature.
A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation
is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached
the configured limit.

See: confluentinc#365
Usiel added a commit to Usiel/kafka-connect-hdfs that referenced this issue Nov 10, 2023
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature.
A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation
is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached
the configured limit.

See: confluentinc#365
Usiel added a commit to Usiel/kafka-connect-hdfs that referenced this issue Nov 10, 2023
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature.
A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation
is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached
the configured limit.

See: confluentinc#365
@Usiel
Copy link

Usiel commented Feb 2, 2024

#671 implements file size based rotation without any probing of file sizes on HDFS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants