Skip to content

FDB Debugging Techniques and Debug Samples

Jingyu Zhou edited this page Feb 25, 2022 · 6 revisions

Debugging Techniques

Look for Severity 40 Events

E.g., cmr | s40. This is assuming you have installed some bash functions from here.

Keep in mind that time is deterministic

One very basic but very powerful property of the simulator is that time is simulated as well, and it is deterministic. This allows us to do a lot of interesting things:

  • We can change the behavior after a certain point in time. For example in order to verify that a bug was fixed we could keep the old behavior right until the bug would happen and then switch to the new behavior.
  • We can add code to make a test crash after some time has passed (this is useful for timeout tests where we don't want to wait on the test to finish)
  • We can add additional tracing just during a critical time period.
  • ...

debugTransaction

This is useful to track how a transaction is processed from client, i.e., NativeAPI, to server processes (e.g., proxy, tlog, and storage servers).

By calling tr.debugTransaction(UID debugID), both reads and commits of the transaction will be logged. Just searching for the debugID in the log, and note this ID is attached to other IDs during the processing.

Note: adding this can change random seed and you may not reproduce the problem.

DEBUG_MUTATION

This is useful to tracking mutations of a specific version or a specific key (range). For example, in the data corruption example below, this technique is used to confirm a key (and a backup mutation) is modified (committed).

To enable this, change MUTATION_TRACKING_ENABLED to 1 and in MutationTracking.h and modify MutationTracking.cpp with the version or key (range). Then look for MutationTracking in trace events.

debugRandom

When debugging, we can use debugRandom() just like deterministicRandom() to generate random numbers, which would not change the random sequence of the existing deterministicRandom().

Add More Trace Events

Adding trace events if better than std::cout, for additional information of simulation time and role. The trace log is also easier for repeated search/grep operations.

fdbdecode for Parsing Backup Mutation Logs

Sometimes we need to know mutations in the backup, then fdbdecode becomes handy.

The following is an example of using this tool. Note the mutation log starts at version MinLogBeginVersion and ends at MaxLogEndVersion. If the target mutation is within this range, then the tool should find it.

./cbuild_output/bin/fdbdecode -r file:///root/simfdb/backups/backup-1970-01-01-00-01-48.793374/ 
Params: ContainerURL: file:///root/simfdb/backups/backup-1970-01-01-00-01-48.793374/, FileFilter: , list_only: false

URL: file:///root/simfdb/backups/backup-1970-01-01-00-01-48.793374/
Restorable: true
Partitioned logs: false
Snapshot:  startVersion=299866656 (maxLogEnd -0.00 days)  endVersion=301988136 (maxLogEnd -0.00 days)  totalBytes=10647884  restorable=true  expiredPct=0.00
SnapshotBytes: 10647884
MinLogBeginVersion:      298765213 (maxLogEnd -0.00 days)
ContiguousLogEndVersion: 338765213 (maxLogEnd -0.00 days)
MaxLogEndVersion:        338765213 (maxLogEnd -0.00 days)
MinRestorableVersion:    301988136 (maxLogEnd -0.00 days)
MaxRestorableVersion:    338765212 (maxLogEnd -0.00 days)

Relevant files are:  2 log files
beginVersion:298765213 endVersion:318765213 blockSize:1048576 filename:logs/0000/0000/log,298765213,318765213,e0f3e48993e59f5d28b677c352640d30,1048576 fileSize:24159023 tagId: (None)
beginVersion:318765213 endVersion:338765213 blockSize:1048576 filename:logs/0000/0000/log,318765213,338765213,c0b471ad226e5ff4f95a85cd04aa823e,1048576 fileSize:2441829 tagId: (None)

304663277 code: ClearRange param1: 0000000000j param2: 0000000000nrhvkfdmjstonitcacfetcxklqxzjdxmwlqkvcoyyruvbzrvjnshatxzbu
... (elided)

Note each mutations is printed with their commit versions, followed by the actual mutation data.

Unseed Mismatch

See this doc for debugging nondeterminism issues.

Gocha's

  • BinaryReader returned objects share the Arena created insider the BinaryReader object, which means the lifetime of the returned objects can't exceed the BinaryReader object. Be careful about memory corruption issues if the BinaryReader object goes out of scope.

  • ASAN/TSAN/UBSAN tests may succeed on one compiler but fail on another.

  • Valgrind build. Enable cmake option USE_VALGRIND=ON.

  • Set _GLIBCXX_DEBUG to debug STL related issue.

Sample Debug Sessions

Sample 1: Backup failure (data corruption) in VV

  • Branch: version-vector-prototype
  • Commit: 44fe89333
  • Seed: -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ApiCorrectnessAtomicRestore.toml -s 600630827 -b off

The test complains data corruption:

DB Range:
7: 0000000000lfp 713
8: 0000000000lfvnnyyhhbpwdgoackhmfmytftucedqhgaccfeczajfuvpkaqihjuwuieifpnwyitsstypdoiixxoilcxyzyrzjsofkelnnyjhavlnoxjewbmdwwlhvq 980

Memory Range:
7: 0000000000lfn 164
8: 0000000000lfp 713

So apparently key 0000000000lfn is missing from DB. Enabled mutation tracking for this key and found 5 clears at versions: 303674707 304036118 304488242 304663277 307827805

1 set at version: 314985785.

All these are within the backup files' range, but using fdbdecode didn't find the last set mutation.

Then I confirmed proxy added backup mutations with key \xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x00 to \xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x04. I also confirmed these backup mutations are persisted at storage servers with customized mutation tracking of the above keys.

Backup mutation is at:

125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x00
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x01
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x02
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x03
125.538085 BackupM ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x04

These mutations appear in TLogCommitMessages that have tags 0:0 0:9 Then at MutationTracking ID=a2b4b46f9d9a406c Label=BackupM At=fetchKeysFinalCommitInject Version=315120917 Tag=0:9

SS applied the mutation:

Time="125.542238" DateTime="2021-10-22T03:50:50Z" Type="MutationTracking" Machine="2.0.1.3:1" ID="a4b37511d365cb0e" Label="BackupM" At="applyMutation" Version="314985785" Mutation="code: SetValue param1: \xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6M9\x00\x00\x00\x00 param2: \x01\x00\x01b\xb0
\x00..." ShardBegin="\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&]\x00\x00\x00\x00\x12J\xf8H\x00\x00\x00\x0f" ShardEnd="\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x8d\x00\x00\x00\x00\x12b|v\x00\x00\x00\x16"

ID=a4b37511d365cb0e Label=BackupM At=applyMutation Version=314985785

Backup process BackupLogRangeTaskFunc::_execute, which divides the range into smaller ones, each calls readCommitted(). Read range:

\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xb7\x42\x80
\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6\x84\xc0

Version 314000000 to 315000000

Key is missing from range read above: \xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xc6\x4d\x39\x00\x00\x00\x00

The last key read is smaller than the missing key:

Time="129.662813" DateTime="2021-10-22T03:50:52Z" Type="TransactionDebug" ID="0000000000000001" Location="storageserver.getKeyValues.AfterReadRange"
...
129.664422 RCD ID=0000000000000000 Key=\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xbfp\x8e\x00\x00\x00\x1b Idx=5
Time="129.664422" DateTime="2021-10-22T03:50:52Z" Type="TransactionDebug" ID="0000000000000001" Location="NativeAPI.getRange.After"

SS reads at 314536078 < 314985785 -- the applied version < 318803044

ReqSS="[{0:0,314536078} maxversion: 314536078] calculates: CommitVersion="314536078"
Time="129.483760" DateTime="2021-10-22T05:34:56Z" Type="GetRangeInfo" Machine="2.0.1.3:1" ID="a4b37511d365cb0e" Begin="\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xb7B\x80" End="\xff\x02/blog/S{\xf3\xa6a5v\xa3VB\xd8Wn\xf8f&\x80\x00\x00\x00\x00\x12\xba\xd6" Version="314536078" ReqVersion="318803044" More="1" SizeOf="9"

The problem is apparent, the range read is not returning the correct data. After adding some trace events, the problem was identified at getKeyValuesQ() at SS, specifically these two lines:

		Version commitVersion = getLatestCommitVersion(req.ssLatestCommitVersions, data->tag);
		state Version version = wait(waitForVersion(data, commitVersion, req.version, span.context));

In the test, req.ssLatestCommitVersions="[{0:0,314536078} maxversion: 314536078] CommitVersion="314536078", and req.version=318803044 So the SS reads at 314536078 < 314985785 ( the applied mutation version at the SS) < 318803044 (read version). As a result, the mutation is not sent back to backup agent, causing mutations being missing from backup files.

The reason of the wrong VV is because proxy doesn't count backup mutations. See the fix is in PR 5830.

Sample 2: VV data corruption with CausalConsistencyError

  • commit: b4ed47053
  • patch: init( PROXY_USE_RESOLVER_PRIVATE_MUTATIONS, true );
  • Seed: -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/SwizzledRollbackSideband.toml -s 243572583 -b on
  • Compiler: Clang

Test failed with: Type="CausalConsistencyError" Machine="3.4.3.8:1" ID="10cb551589353523" MessageKey="Sideband/Message/c9037706b28b3e04" RemoteCommitVersion="841905696" LocalReadVersion="946662688"

Enable mutation tracking on key Sideband/Message/c9037706b28b3e04, and saw last writes this key at version 841905696:

315.681248 MutationTracking ID=2c1851ccbb311b41 Label=SomeKey At=applyMutation Version=841905696 Mutation=code: SetValue param1: Sideband/Message/c9037706b28b3e04 param2: deadbeef ShardBegin= ShardEnd=Sideband/Message/e
315.681248 MutationTracking ID=0f20d996da4c284b Label=SomeKey At=ServerPeekCursor Version=841905696 Mutation=code: SetValue param1: Sideband/Message/c9037706b28b3e04 param2: deadbeef MessageTags=-2:3 0:1

However, the storage server saw VV using version 841885664:

325.140962 MutationTracking ID=2c1851ccbb311b41 Label=SomeKey At=ShardGetValue Version=841885664 Mutation=code: DebugKey param1: Sideband/Message/c9037706b28b3e04 param2: <null>
325.140962 MutationTracking ID=2c1851ccbb311b41 Label=SomeKey At=ShardGetPath Version=841885664 Mutation=code: DebugKey param1: Sideband/Message/c9037706b28b3e04 param2: 2
325.140962 MutationTracking ID=2c1851ccbb311b41 ReqVersion=946662688 VV=[{0:1,841885664} maxversion: 841885664] CommitVersion=841885664

Thus, VV is not update to date. Further investigation showed there is a recovery between these events:

309.759683 2.0.1.1:1 ID=3c39387664f7a73b 12 all_logs_recruited ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
319.510407 2.0.1.1:1 ID=da60ac7999cc737f 0 reading_coordinated_state ThreadID=8063736322934559140 Roles=BK,CP,DD,GP,LR,MS,RK,SS,TL
319.517734 2.0.1.1:1 ID=da60ac7999cc737f 1 locking_coordinated_state TLogs=1 ActiveGenerations=2 MyRecoveryCount=16 ForceRecovery=0 ThreadID=8063736322934559140 Roles=BK,CP,DD,GP,LR,MS,RK,SS,TL
320.166834 2.0.1.1:1 ID=da60ac7999cc737f 3 reading_transaction_system_state ThreadID=8063736322934559140 Roles=BK,CP,DD,LR,MS,RK,SS,TL
320.171453 2.0.1.1:1 ID=da60ac7999cc737f 7 recruiting_transaction_servers Conf={"backup_worker_enabled":1,"commit_proxies":5,"grv_proxies":1,"log_routers":6,"log_spill":2,"log_version":6,"logs":6,"perpetual_storage_wiggle":0,"perpetual_storage_wiggle_locality":"data_hall:0","proxies":6,"redundancy_mode":"single","regions":[{"datacenters":[{"id":"0","priority":2},{"id":"2","priority":1,"satellite":1},{"id":"4","priority":1,"satellite":1}]},{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1},{"id":"5","priority":1,"satellite":1}]}],"remote_redu... RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1 ThreadID=8063736322934559140 Roles=BK,CP,DD,LR,MS,RK,SS,TL
320.176199 2.0.1.1:1 ID=da60ac7999cc737f 8 initializing_transaction_servers CommitProxies=2 GrvProxies=1 TLogs=2 Resolvers=1 SatelliteTLogs=0 OldLogRouters=1 StorageServers=0 BackupWorkers=3 PrimaryDcIds=0 RemoteDcIds=1 ThreadID=8063736322934559140 Roles=DD,MS,RK,SS,TL
320.198125 2.0.1.1:1 ID=da60ac7999cc737f 9 recovery_transaction PrimaryLocality=-1 DcId=0 ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
320.225664 2.0.1.1:1 ID=da60ac7999cc737f 10 writing_coordinated_state TLogList=0: 43fd2181d0ada4f366e1e610ea0e2194, 036201ec527e05078cf14be39319f12b ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
324.128473 2.0.1.1:1 ID=da60ac7999cc737f 11 accepting_commits StoreType=memory-radixtree-beta RecoveryDuration=4.61807 ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
324.128473 2.0.1.1:1 ID=da60ac7999cc737f 12 all_logs_recruited ThreadID=8063736322934559140 Roles=CP,DD,GP,LR,MS,RK,SS,TL
333.554110 2.0.1.1:1 ID=da60ac7999cc737f 14 fully_recovered FullyRecoveredAtVersion=955256309 ThreadID=8063736322934559140 Roles=BK,CP,DD,GP,LR,MS,RK,SS,TL

Added more instrumentation and what happened seems to be: A client has a VV cache of [{0:1,841885664} maxversion: 841885664]; client issues a transaction commits at 841905696. There is a recovery. The client got Delta=[ maxversion: -1] back. Client issues a read at 946662688, which uses the cached version 841885664 < 841905696, thus reading corrupted data. The client side code didn't know there is a recovery, thus kept using the cache from older master, which caused this problem. The fix is to reset client side cache once a new master is used.

Sample 3: VV failure DeliveredToNotAssigned assertion failure at SS

  • commit: a8ba9bd02
  • patch: init( PROXY_USE_RESOLVER_PRIVATE_MUTATIONS, true );
  • seed: -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ApiCorrectness.toml -s 809470518 -b on
  • Failure: DeliveredToNotAssigned ID=0000000000000000 Version=184396025 Mutation=code: SetValue param1: \xff/moveKeysLock/Write param2: \x15\x1d\xc0%\x06\xb2\xa3)\x96[{*\x8b\xd5\x17\x1c
  • DB configuration is single: new backup_worker_enabled:=1 log_routers:=4 log_spill:=2 log_version:=6 perpetual_storage_wiggle:=0 single remote_triple memory-2 storage_migration_type=disabled tss_count:=1 usable_regions:=2 tss_storage_engine:=1 regions=[{"datacenters":[{"id":"0","priority":2},{"id":"2","priority":1,"satellite":1,"satellite_logs":5}],"satellite_redundancy_mode":"one_satellite_double"},{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1}],"satellite_redundancy_mode":"one_satellite_double"}]

Failed SS ID=eccf72642c631fe9 and Tag="0:0".

The failure DeliveredToNotAssigned means the SS received a mutation that it is not responsible for. So we need to look at shard ranges on the SS and see how they are moved. The shard movement is through MoveKey operations. Since the failed key is \xff/moveKeysLock/Write, naturally we'll enable mutation tracking for this key.

By going through log with more instrumentation on source and destination of MoveKey operations, we can find that there is a shard movement before:

Time="117.703077" DateTime="2021-11-10T18:34:37Z" Type="RelocateShard_FinishMoveKeysWaitDurable" Machine="2.0.1.0:1" ID="9da64b1d90b87835" EndPair="3aac5b0622bfae4e" ReadyServers="2" ReadyTSS="1" SRCs="36486523b7b50a46,b0ecc74478ea7330,c201743d86431db5,eccf72642c631fe9" DESTs="36486523b7b50a46,c201743d86431db5"

This is the final transaction, i.e., FinishMoveKey. At Resolver, 0:0 no longer responsible for range [0000000002zu, \xff\xff) and the range is moved to 0:1.

117.720628 SendingPrivatized_Span ID=8fe8fb3abf0743cd Span=0000000000000000
117.720628 SendingPrivateMutation ID=8fe8fb3abf0743cd Original=code: SetValue param1: \xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/0000000002zu param2:  Privatized=code: SetValue param1: \xff\xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/0000000002zu param2:  Server=eccf72642c631fe9 TagKey=\xff/serverTag/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13 Tag=0:0
117.720628 SendingPrivatized_Span ID=8fe8fb3abf0743cd Span=0000000000000000
117.720628 SendingPrivateMutation ID=8fe8fb3abf0743cd Original=code: SetValue param1: \xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/\xff\xff param2:  Privatized=code: SetValue param1: \xff\xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/\xff\xff param2:  Server=eccf72642c631fe9 TagKey=\xff/serverTag/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13 Tag=0:0

More instrumentation confirmed the transaction committed: 117.921585 MKIRDone ID=c5bcfd554ae74b5a ReadVersion=181974393 CommitVersion=182968702 Range=Begin:0000000002zuEnd:\\xff\\xff and SS applied this mutation: 119.844811 SSPrivateMutation ID=eccf72642c631fe9 Mutation=code: SetValue param1: \xff\xff/serverKeys/\xe9\x1fc,dr\xcf\xec?3\\8\xab\x82\xe6\x13/0000000002zu param2: Version=182968702.

Shard has been moved to DESTs 0:1,1:1. And next MoveKey lock write is sent to SS 0:1, so CP tags the mutation correctly:

"117.722621" DateTime="2021-11-10T18:34:37Z" Type="MutationTracking" Machine="2.0.1.2:1" ID="84ae49626c5ed171" Label="MoveKey" At="ProxyCommit" Version="182968702" Mutation="code: SetValue param1: \xff/moveKeysLock/Write param2: \\\xf0\x05t\xd23\x0c\xbb\xb3\xfe'\xd3\x97\xad\x15g" To="0:1,1:1"

Another move key starts for range [0000000001z, 0000000002zu):

119.089702 RelocateShardHasDestination ID=22e7af88890c7882 PairId=3e16733cbbb3e507 KeyBegin=0000000001z KeyEnd=0000000002zu SourceServers=b0ecc74478ea7330,eccf72642c631fe9 DestinationTeam=36486523b7b50a46,c201743d86431db5 ExtraIds=[no items]
119.135619 MKIRStartMK ID=3e16733cbbb3e507 ReadVersion=182968702 VV=[{0:0,182454221}{1:1,182968702}{1:0,182454221}{0:1,182968702}{1:2,182368775} maxversion: 182968702] Range=Begin:0000000001zEnd:0000000002zu EndKey=0000000002zu

Look closely, we found that the VV has {0:0,182454221}, which has wrong version, 182454221 != 182968702, because the previous move has written to 0:0 at 182968702.

Now we have a breakthrough! The problem is the version vector: we instrumented master at version 182968702 and confirmed that tag 0:0 is not in the written tags sent from the commit proxy. So this is the problem. By examining code, we noticed the private mutation generated on the Resolver, but their tags are not sent back to commit proxies, thus their tags are not updated. The fix is in PR 5960.