Skip to content

Small object files problem for multi schema for a single topic #537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gokhansari opened this issue Jan 6, 2021 · 0 comments
Open

Small object files problem for multi schema for a single topic #537

gokhansari opened this issue Jan 6, 2021 · 0 comments

Comments

@gokhansari
Copy link

I was trying to convert avro files to parquet and sink them by using kafka-connect-hdfs connector. According to this confluent blog, It is possible to go with multiple schema for a single topic.

Firstly I tried schema references approach which is recommended, but I hit partitioning problem on kafka connect. After Avro Schema to Connect Schema conversion process I noticed there is a wrapper struct model which kafka partitioner can not parse dynamically for multiple schema type. I opened this issue on kafka-connect-storage-common.

Then I decided to go with an other approach. Using TopicRecordNameStrategy to provide multiple schema support. After a few try I noticed rotation strategies does not work properly. Almost each file has one or two messages on HDFS. There were a lot of small object files on HDFS. Something was breaking rotation strategy.
Sadly I saw these lines in documentation:

Schema evolution only works if the records are generated with the default naming strategy, which is TopicNameStrategy. An error may occur if other naming strategies are used. This is because records are not compatible with each other. schema.compatibility should be set to NONE if other naming strategies are used. This may result in small object files because the sink connector creates a new file every time the schema ID changes between records.

Having these small parquet file objects are not appropriate for the nature/purpose of parquet format and It is not easy to process these parquet files when you have enough high traffic.

In the end, I could not find any proper way for persisting messages to HDFS over Kafka Connect when there are multiple schema types for a single topic.

Any suggestion or idea? I will be appreciated for your answers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant