Skip to content
This repository was archived by the owner on Oct 5, 2023. It is now read-only.

Restore messages from s3#34

Merged
loffek merged 11 commits intos3from
TC-128/read-messages-from-s3-and-write-to-cluster
Nov 6, 2020
Merged

Restore messages from s3#34
loffek merged 11 commits intos3from
TC-128/read-messages-from-s3-and-write-to-cluster

Conversation

@MacRudDNA
Copy link

…t of records

Copy link

@loffek loffek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! But see my question around de-dup logic

@MacRudDNA MacRudDNA changed the title Implemented reading batch files from s3 and deserialize them into lis… Restore messages from s3 Nov 6, 2020
@MacRudDNA
Copy link
Author

MacRudDNA commented Nov 6, 2020

How it works.
Based on topics to restore allow and deny list we get list of TopicPartitionToRestore objects. For each there will be run one worker in thread pool with its own transactional producer.

Worker obtains list of json files with backup data and tries to send content of each batch in one transaction. After sending each message it's offset is saved in memory to prevent duplicates - before sending message we check if it's offset has already ben produced.
After committing transaction we read metadata of produced records to update new offset that each of the messages got.
So we end up with map of old-new offset for each restored message.

It is possible to force producer to use transaction for one produced record - to do this one needs to change RestoreMessageProducer.PRODUCER_BATCH_SIZE to 1.

Please note that we produce each record with timestamp we have from backup. Looking at documentation HERE I would argue that we can always give timestamp to produced message but for CreateTime type we can get some messages not being restored if restoring old messages. On the other hand if message.timestamp.difference.max.ms is same as retention.ms it should make no difference for us because older messages would also disappear from source cluster.

@MacRudDNA MacRudDNA marked this pull request as ready for review November 6, 2020 12:31
Copy link

@loffek loffek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳

@loffek
Copy link

loffek commented Nov 6, 2020

Please note that we produce each record with timestamp we have from backup. Looking at documentation HERE I would argue that we can always give timestamp to produced message but for CreateTime type we can get some messages not being restored if restoring old messages. On the other hand if message.timestamp.difference.max.ms is same as retention.ms it should make no difference for us because older messages would also disappear from source cluster.

💯

@loffek loffek merged commit 3cc4ce6 into s3 Nov 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants