Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs/content: checkpoint should not overwrite current checkpointed reference #6629

Open
chu11 opened this issue Feb 12, 2025 · 1 comment
Open

Comments

@chu11
Copy link
Member

chu11 commented Feb 12, 2025

Currently the checkpoint is saved to the same location, overwriting whatever was already there.

In the event something goes south w/ a final checkpoint, there is no way to recover with a prior checkpoint.

There should be a rolling history of checkpoints, in the event the last checkpoint failed.

Some complexity here b/c not only do we support the rolling history, any mechanisms that recover from checkpoint may need to "go back in history" to earlier one.

@garlick
Copy link
Member

garlick commented Feb 13, 2025

Maybe for a start, something like this could be employed to keep N checkpoints in the database at all times

https://stackoverflow.com/questions/1977341/rolling-rows-in-sql-table

One thought is to add a flux kvscheck command that is run in rc1 before the KVS module is loaded. It could walk the hash tree from the current checkpoint. If it finds any missing data, it could abort, causing the instance to abort. We could add options to fix the KVS offline. For example to roll back to a previous checkpoint, or perhaps taking other measures. See also

But perhaps we could keep this issue focused on simply storing the latest N checkpoints in the database safely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants