-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle cases when backup worker pulling may miss mutations #11908
Conversation
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
CI failure is TraceTooManyLines error fixed in #11935 |
fdbserver/BackupWorker.actor.cpp
Outdated
try { | ||
tr.setOption(FDBTransactionOptions::ACCESS_SYSTEM_KEYS); | ||
tr.setOption(FDBTransactionOptions::PRIORITY_SYSTEM_IMMEDIATE); | ||
tr.setOption(FDBTransactionOptions::LOCK_AWARE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is doing READ_LOCK_AWARE and READ_SYSTEM_KEYS better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
I.e., throw an error to trigger a recovery.
This is needed so that CC knows the lower bound of versions that can be included in a backup.
Because in NOOP mode, backup workers still writes to the database, and cause non-empty storage queues.
Otherwise, the pop version can become larger than the actual saved version when switching to the regular pulling mode. Because the pop version is larger, mutations larger than saved version could be popped and no long available.
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
CI failures are unrelated to this change. |
When TLog replies with a popped version, it means the peek request is missing mutations from the peek's begin version to the popped version. This was not handled in backup worker previously. (#10990 has a different fix of removing the noop mode.)
This PR addresses this by checking if the missing versions are caused by NOOP mode pops. This can only happen when the worker requests mutations in the previous epoch. If so, the worker reads the max popped version due to noop mode, which should be >= tlog reply's popped version. If not, an assertion is triggered, because mutations are missing.
In the NOOP mode, we save the popped version in the system key space "backupWorkerMaxNoopVersionKey". The transaction that saves noop version can cause QuietDatabase workload failure, so I disabled it before the consistency check.
100k BackupCorrectnessPartitioned.toml 20250210-050111-jzhou-37ef394139f3b154
100k regular correctness 20250217-175836-jzhou-bd7a8310904ebc19 , 20250207-033912-jzhou-e83d1a7d767205be
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)