Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Recommended config for CSV without primary key #146

Open
scheung38 opened this issue Sep 14, 2020 · 6 comments
Open

[Question] Recommended config for CSV without primary key #146

scheung38 opened this issue Sep 14, 2020 · 6 comments

Comments

@scheung38
Copy link

scheung38 commented Sep 14, 2020

What is the recommended source config if loading CSV into Kafka without unique primary keys? This will be consumed by postgres and other sinks

@jcustenborder
Copy link
Owner

@scheung38 Keys do not need to be unique. If you get two keys that are the same you will still get two records into Kafka. They will end up in the same partition.

@scheung38
Copy link
Author

scheung38 commented Sep 14, 2020

For example:

Screenshot 2020-09-15 at 00 08 14

So given this scenario, any recommended config? if there are no keys then later on making a KSQL KTable will require primary key? But how to make each row identifiable when updates occur later if no unique key?

@jcustenborder
Copy link
Owner

I'm not sure on the KSQL part. I'd have to look into that. I think it doesn't support compound keys so you might need to use a single message transform to make just "order-01" your key. If you are looking for the aggregate view of order-01 you might need to use Kafka Streams instead and build a hierarchy.

@scheung38
Copy link
Author

scheung38 commented Sep 14, 2020

Then could we hash each row on the fly to provide uniqueness, better to write UDF that hashes based on several columns, say C, D, E for hash1 and E, F, G for hash2.

But just from one field say 'order-01' wont provide sufficient uniqueness, require several fields.

@jcustenborder
Copy link
Owner

Maybe. Given I don't fully understand this data I can't tell you for sure. You could also build a single message transform that concatenates a few fields. Think order-01:8:2020, etc. I personally would want to aggregate this to another topic that is keyed by order-1 with an array of the other content. something like key: order-1, value: [{},{},{}] where values are the combination of those rows. That would give me all the data of an order in a single record

@scheung38
Copy link
Author

scheung38 commented Sep 14, 2020

This is just a tiny sample, will get large amount of data. Other than the first two columns the rest of the fields should be different enough to generate unique hash as primary key for each row. Why put into new topic? Could we not append hash as new column?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants