Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database BackupManager #1711

Open
wants to merge 47 commits into
base: nightly
Choose a base branch
from
Open

Database BackupManager #1711

wants to merge 47 commits into from

Conversation

jfldde
Copy link
Contributor

@jfldde jfldde commented Jan 15, 2025

Description

  • Introduces new BackupManager to backup all databases required by a citrea node.
  • Introduces two new RPCs:
    • backup_create <optional_path> to create a node backup. Defaults to using path from config if None. Errors if both config backup path and path are not set.
    • backup_validate <path> to validate a node backup at path
    • backup_info <path> to get info about backup at path
  • Adds new --restore-db <path> arg to restore a node from a backup at path.

Linked Issues

Depends on #1714

Pending on newly introduce citrea-cli in #1808 to extract and move restore-db as a cli command

@jfldde jfldde requested a review from eyusufatik as a code owner January 15, 2025 13:32
@auto-assign auto-assign bot requested a review from kpp January 15, 2025 13:32
@jfldde jfldde marked this pull request as draft January 15, 2025 13:32
@jfldde jfldde requested a review from Chinacolt January 15, 2025 13:37
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 24.54545% with 332 lines in your changes missing coverage. Please review.

Project coverage is 76.8%. Comparing base (36f0e6c) to head (907b155).

Files with missing lines Patch % Lines
crates/common/src/backup/manager.rs 21.1% 164 Missing ⚠️
crates/common/src/backup/utils.rs 0.0% 70 Missing ⚠️
crates/common/src/backup/rpc.rs 20.4% 66 Missing ⚠️
...overeign-sdk/full-node/db/sov-schema-db/src/lib.rs 0.0% 11 Missing ⚠️
bin/citrea/src/cli.rs 0.0% 7 Missing ⚠️
crates/light-client-prover/src/da_block_handler.rs 0.0% 5 Missing ⚠️
crates/light-client-prover/src/services.rs 0.0% 3 Missing ⚠️
...es/sovereign-sdk/full-node/db/sov-db/src/mmr_db.rs 0.0% 3 Missing ⚠️
bin/citrea/src/main.rs 0.0% 2 Missing ⚠️
bin/citrea/src/rollup/mod.rs 93.7% 1 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
bin/citrea/src/rollup/bitcoin.rs 0.0% <ø> (ø)
bin/citrea/src/rollup/mock.rs 76.3% <100.0%> (+0.7%) ⬆️
crates/batch-prover/src/da_block_handler.rs 75.0% <100.0%> (+0.3%) ⬆️
crates/batch-prover/src/lib.rs 100.0% <100.0%> (ø)
crates/batch-prover/src/runner.rs 84.8% <100.0%> (+0.7%) ⬆️
crates/common/src/config.rs 94.8% <100.0%> (+<0.1%) ⬆️
crates/fullnode/src/da_block_handler.rs 72.9% <100.0%> (+0.2%) ⬆️
crates/fullnode/src/lib.rs 100.0% <100.0%> (ø)
crates/fullnode/src/runner.rs 90.3% <100.0%> (+0.1%) ⬆️
crates/sequencer/src/lib.rs 100.0% <100.0%> (ø)
... and 15 more

... and 2 files with indirect coverage changes

@jfldde jfldde marked this pull request as ready for review January 20, 2025 10:59
@auto-assign auto-assign bot requested a review from rakanalh January 20, 2025 10:59
bin/citrea/src/rollup/mod.rs Outdated Show resolved Hide resolved
@jfldde jfldde added the HOLD-MERGE PR is not draft but should not be merged yet label Jan 30, 2025
@jfldde
Copy link
Contributor Author

jfldde commented Jan 30, 2025

Holding merge until devops' feedback

crates/common/src/backup/manager.rs Outdated Show resolved Hide resolved
crates/common/src/backup/manager.rs Show resolved Hide resolved
crates/common/src/backup/rpc.rs Show resolved Hide resolved
crates/sovereign-sdk/full-node/db/sov-db/Cargo.toml Outdated Show resolved Hide resolved
Copy link

@Chinacolt Chinacolt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backup & Restore Observations and Questions

Backup Creation (backup_create)

  • Why do we pass the id field?
  • I tested whether it blocks ongoing synchronization and RPC operations. It does not. ✅

Backup Information (backup_info)

  • Regardless of which id we pass, it always returns all backup information.
  • If id is ignored, what is the purpose of passing it?

Backup Validation (backup_validate)

  • If we pass a non-existent path, it fails. Otherwise, it always returns true.
  • The id field doesn’t seem to have any effect—tested with id: -73, and it still returned true.

Restore (restore)

  • I tested restore several times with --restore-db "/mnt/task/backup", and it worked consistently. ✅
  • Question: Can we specify which id to restore? It seems to always restore the latest version.

Other Observations (Not necessarily issues, just for awareness)

1. Version Tracking Behavior

  • Deleted the latest (6th) backup and then restored. It reverted to the 5th version, which became the latest.
  • After restoring, creating a new backup resulted in version 7.
  • It seems the system tracks previously created versions even after a restore. Is this expected behavior?

2. Partial Backup Deletion & Restore Behavior

  • Deleted the last 3 versions from native-db, last 2 versions from state, and the last one from ledger.
  • After restore, each database seemed to restore to its latest available backup.
  • However, once the restore was completed, an error occurred:
    "error": "Timestamp should be greater"

Suggestion:

  • We, ops, can validate if all DBs are at the same version before restoring to avoid inconsistency.
  • This validation doesn’t necessarily have to be handled inside Citrea’s binary, but we need a way to validate individual versions by passing the id parameter to backup_validate.

3. Post-Restore Backup Behavior

After restore completion, a new directory citrea-db.bak-{EPOCH} appears.
Question:

  • Does this directory contain a backup of the newly restored version?
  • Or is it the previous citrea-db backup just before it was replaced by citrea-db.tmp-{EPOCH}?

P.S: All tests are tried on Dev-Net Full-Node EU

@jfldde
Copy link
Contributor Author

jfldde commented Feb 6, 2025

Backup & Restore Observations and Questions

Backup Creation (backup_create)

* Why do we pass the `id` field?

* I tested whether it blocks ongoing synchronization and RPC operations. It does not. ✅

Backup Information (backup_info)

* Regardless of which `id` we pass, it always returns all backup information.

* If `id` is ignored, what is the purpose of passing it?

Backup Validation (backup_validate)

* If we pass a non-existent path, it fails. Otherwise, it always returns `true`.

* The `id` field doesn’t seem to have any effect—tested with `id: -73`, and it still returned `true`.

I think there's a confusion on the API. The three routes expect only a backup path (optional for backup_create if set in config). jsonrpsee typing is a bit loose here and just ignore extra parameters.
id is internal to rocksdb backup mechanism and is used to restore at previous backup iteration.

Restore (restore)

* I tested restore several times with `--restore-db "/mnt/task/backup"`, and it worked consistently. ✅

* **Question:** Can we specify which `id` to restore? It seems to always restore the latest version.

I'll amend so that we can pass an optional id to restore to a previous backup. At the root of the backup dir, you can find a mapping between L2 height and backup id.

Other Observations (Not necessarily issues, just for awareness)

1. Version Tracking Behavior

* Deleted the latest (6th) backup and then restored. It reverted to the 5th version, which became the latest.

* After restoring, creating a new backup resulted in version 7.

* It seems the system tracks previously created versions even after a restore. Is this expected behavior?

This is probably due to inconsistency between you manually deleting 6th backup and rocksdb backup engine pulling the id from some remaining parts. Just tested with removing the latest on every subdir and it succesfully used the right id

2. Partial Backup Deletion & Restore Behavior

* Deleted the last 3 versions from `native-db`, last 2 versions from `state`, and the last one from `ledger`.

* After restore, each database seemed to restore to its latest available backup.

* However, once the restore was completed, an error occurred:
  `"error": "Timestamp should be greater"`

Suggestion:

* We, ops, can validate if all DBs are at the same version before restoring to avoid inconsistency.

* This validation doesn’t necessarily have to be handled inside Citrea’s binary, but we need a way to validate individual versions by passing the `id` parameter to `backup_validate`.

The thing is, the backup is made out of 3 different rocksdb backup, one state, one native, one ledger.
If you are to delete - let's say some version of native-db - but don't touch the other ones, they are still valid backups themselves, just not in sync between them anymore. I'll make sure to assert that all backups are at the same expected id on validate.
Assert that all subdirs have the same number of incremental backups

3. Post-Restore Backup Behavior

After restore completion, a new directory citrea-db.bak-{EPOCH} appears. Question:

* Does this directory contain a backup of the newly restored version?

* Or is it the previous `citrea-db` backup just before it was replaced by `citrea-db.tmp-{EPOCH}`?

This is a backup of the previous citrea-db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HOLD-MERGE PR is not draft but should not be merged yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RocksDB backups through RPC
5 participants