[Question] Recommended config for CSV without primary key #146

scheung38 · 2020-09-14T23:02:34Z

What is the recommended source config if loading CSV into Kafka without unique primary keys? This will be consumed by postgres and other sinks

jcustenborder · 2020-09-14T23:04:06Z

@scheung38 Keys do not need to be unique. If you get two keys that are the same you will still get two records into Kafka. They will end up in the same partition.

scheung38 · 2020-09-14T23:08:36Z

For example:

So given this scenario, any recommended config? if there are no keys then later on making a KSQL KTable will require primary key? But how to make each row identifiable when updates occur later if no unique key?

jcustenborder · 2020-09-14T23:14:54Z

I'm not sure on the KSQL part. I'd have to look into that. I think it doesn't support compound keys so you might need to use a single message transform to make just "order-01" your key. If you are looking for the aggregate view of order-01 you might need to use Kafka Streams instead and build a hierarchy.

scheung38 · 2020-09-14T23:17:37Z

Then could we hash each row on the fly to provide uniqueness, better to write UDF that hashes based on several columns, say C, D, E for hash1 and E, F, G for hash2.

But just from one field say 'order-01' wont provide sufficient uniqueness, require several fields.

jcustenborder · 2020-09-14T23:26:07Z

Maybe. Given I don't fully understand this data I can't tell you for sure. You could also build a single message transform that concatenates a few fields. Think order-01:8:2020, etc. I personally would want to aggregate this to another topic that is keyed by order-1 with an array of the other content. something like key: order-1, value: [{},{},{}] where values are the combination of those rows. That would give me all the data of an order in a single record

scheung38 · 2020-09-14T23:33:17Z

This is just a tiny sample, will get large amount of data. Other than the first two columns the rest of the fields should be different enough to generate unique hash as primary key for each row. Why put into new topic? Could we not append hash as new column?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Recommended config for CSV without primary key #146

[Question] Recommended config for CSV without primary key #146

scheung38 commented Sep 14, 2020 •

edited

Loading

jcustenborder commented Sep 14, 2020

scheung38 commented Sep 14, 2020 •

edited

Loading

jcustenborder commented Sep 14, 2020

scheung38 commented Sep 14, 2020 •

edited

Loading

jcustenborder commented Sep 14, 2020

scheung38 commented Sep 14, 2020 •

edited

Loading

[Question] Recommended config for CSV without primary key #146

[Question] Recommended config for CSV without primary key #146

Comments

scheung38 commented Sep 14, 2020 • edited Loading

jcustenborder commented Sep 14, 2020

scheung38 commented Sep 14, 2020 • edited Loading

jcustenborder commented Sep 14, 2020

scheung38 commented Sep 14, 2020 • edited Loading

jcustenborder commented Sep 14, 2020

scheung38 commented Sep 14, 2020 • edited Loading

scheung38 commented Sep 14, 2020 •

edited

Loading

scheung38 commented Sep 14, 2020 •

edited

Loading

scheung38 commented Sep 14, 2020 •

edited

Loading

scheung38 commented Sep 14, 2020 •

edited

Loading