-
Notifications
You must be signed in to change notification settings - Fork 396
Feature request: Rotation based on maximum file size on hdfs. #365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this would be useful change, if handled appropriately. For example a schema could change within the middle of a file (in any format, not necessarily Avro or Parquet). Text or JSON might not be that big of a deal, but would be annoying if a process failed parsing 8 CSV columns for example, but only got 7. Plus, time based partitioning should not wait for the size to be reached before writing the data into the "truncated" datetime for a given partitioner The "workarounds" are all downstream processors that I've enumerated in #271 |
@TomLous this feature is useful indeed. It hasn't been rejected previously, as far as I know. @Cricket007 the partitioning on size should not split records and should respect other partitioning criteria. |
Thanks for the feedback. I've already created an implementation that respects other filter criteria for my current client. I'm waiting for their legal team to get back to me, so I can create a PR. |
Any progress on this feature? |
Sorry, not from my end. The OK was never given at eBay and now I'm no longer working there. |
Any progress on this feature guys? |
Here This is what I implemented for Sequence file , write to file in append mode till reach file size limit and then only do final commit Let me know guys what are your input so I will refactor it |
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature. A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached the configured limit. See: confluentinc#365
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature. A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached the configured limit. See: confluentinc#365
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature. A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached the configured limit. See: confluentinc#365
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature. A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached the configured limit. See: confluentinc#365
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature. A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached the configured limit. See: confluentinc#365
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature. A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached the configured limit. See: confluentinc#365
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature. A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached the configured limit. See: confluentinc#365
Being able to control the size of the written files is extremely helpful. This change gets us in the vicinity of such a feature. A file size guarantee would require a major rewrite, as the current logic is heavily based on partitions. The file size based rotation is implemented in such a way, that we rotate all open files for a given `TopicPartitionWriter` instance, when any file has reached the configured limit. See: confluentinc#365
#671 implements file size based rotation without any probing of file sizes on HDFS. |
I'd like the option to specify the maximum file size for the hdfs connector to write before rotating.
I understand the only way to do this is to approximate it by setting flushSize (based on # records) or time interval.
The reason is that it's very useful to keep files at the approximate size of the hdfs block size, but no more. This gives us a more fine grained control and the assurance that we won't end up with to large files, or worse, many small files.
I'd like to add this feature to the codebase myself if possible. Are there any restrictions / guidelines I have to take into account if I wan the possibility to merge these changes back into the codebase?
Or was this feature previously explored and abandoned for some reason?
The text was updated successfully, but these errors were encountered: