-
Notifications
You must be signed in to change notification settings - Fork 0
7.1 Bugs
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/IncrementTest.toml -s 905681640 -b on
tenants_disabled error in first one
PR 7102 fixes this WriteDuringReadClean
failure
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/AtomicBackupCorrectness.toml -s 569974852 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ConfigureStorageMigrationTest.toml -s 700203409 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/MoveKeysSideband.toml -s 923441921 -b on
411.919147 ConfigureDatabase_WrongStoreType ID=0000000000000000 SuppressedEventCount=0 ServerID=15171f53419b0da8 ProcessID=f88dfc9d951c08d08 71bddfb3fc4a3ec ServerStoreType=ssd-1 ConfigStoreType=memory TS
369.625483 SSTag ID=15171f53419b0da8 MyTag=0:3 SS 369.814814 ServerInfo ID=7647ccfd84beb6ee ServerInfoIndex=0 ServerID=15171f53419b0da89a9a396caecfaee3 ServerTeamOwned=1 MachineID=b3285876b51e1494ea6bbd30b99ce1a7 StoreType=ssd-1 InDesiredDC=1 BK,DD,GP,MS,RK,SS,TL 369.814814 ServerStatus ID=7647ccfd84beb6ee ServerID=15171f53419b0da8 Healthy=1 MachineIsValid=1 MachineTeamSize=1 BK,DD,GP,MS,RK,SS,TL 369.814814 ServerTeamInfo ID=7647ccfd84beb6ee TeamIndex=3 Healthy=1 TeamSize=1 MemberIDs=15171f53419b0da89a9a396caecfaee3 TeamID=29a39df4d9fa738b BK,DD,GP,MS,RK,SS,TL 369.814814 MachineInfo ID=7647ccfd84beb6ee MachineInfoIndex=2 Healthy=1 MachineID=b3285876b51e1494ea6bbd30b99ce1a7 MachineTeamOwned=1 ServerNumOnMachine=1 ServersID=15171f53419b0da89a9a396caecfaee3 BK,DD,GP,MS,RK,SS,TL 369.814814 MachineLocalityMap ID=7647ccfd84beb6ee LocalityIndex=2 UID=15171f53419b0da89a9a396caecfaee3 LocalityRecord=KeyValueArraySize:5 KeyValueArrayIndex:0 Key:0 Value:0 KeyValueArrayIndex:1 Key:1 Value:0 KeyValueArrayIndex:2 Key:2 Value:7 KeyValueArrayIndex:3 Key:3 Value:8 KeyValueArrayIndex:4 Key:4 Value:9 BK,DD,GP,MS,RK,SS,TL
Issuing configuration change: ssd-1 Issuing configuration change: ssd-1 Issuing configuration change: backup_worker_enabled:=1 Issuing configuration change: storage_migration_type=aggressive Issuing configuration change: log_version:=6 Issuing configuration change: backup_worker_enabled:=1 Issuing configuration change: backup_worker_enabled:=0 Issuing configuration change: memory-radixtree-beta Issuing configuration change: log_version:=4 Issuing configuration change: perpetual_storage_wiggle=1 storage_migration_type=gradual Issuing configuration change: storage_migration_type=aggressive Issuing configuration change: log_version:=3 Issuing configuration change: log_version:=5 Issuing configuration change: backup_worker_enabled:=1 Issuing configuration change: memory Issuing configuration change: perpetual_storage_wiggle=1 Issuing configuration change: ssd-1 Issuing configuration change: log_spill:=1 Issuing configuration change: perpetual_storage_wiggle=0 storage_migration_type=aggressive Issuing configuration change: perpetual_storage_wiggle=1 storage_migration_type=gradual perpetual_storage_wiggle_locality=zoneid:cf3ae9e1138c53ae7d02e04a29ab84a9 Issuing configuration change: memory-2 Issuing configuration change: backup_worker_enabled:=0 Issuing configuration change: perpetual_storage_wiggle=1 perpetual_storage_wiggle_locality=processid:f88dfc9d951c08d0871bddfb3fc4a3ec DestroyDatabaseWorkload;RandomClogging;RandomClogging complete CloggedConfigureDatabaseTest complete
-f ./foundationdb/tests/fast/FuzzApiCorrectnessClean.toml -s 851021642 -b off
commit: 3f1955c8f, 20220429-162316-jzhou-5e37b8b1b12a976b
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/AtomicBackupCorrectness.toml -s 359817600 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/LowLatencyWithFailures.toml -s 736426219 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/ConfigIncrementWithKills.toml -s 269771905 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/DataLossRecovery.toml -s 944886588 -b on
-f ./foundationdb/tests/fast/AtomicBackupCorrectness.toml -s 359817600 -b on Assertion newOldestVersion <= latestVersion failed @ /root/src/foundationdb/fdbclient/VersionedMap.h 687: newOldestVersion: 6029252 , latestVersion 1
SS 1e23cf7e86d61103, Tag=1:1 Begin=1 TLog ID=9c3efdbc30ba6aa6 LogId=8fea9f639336489a
7.562937 ClusterRecoveryRecruitedInitialStorageServers Machine=[abcd::2:0:1:0]:1 ID=1ebe53f0e31135dd TargetCount=3 Servers=d43d55f85ea323fa,3faa95919bb122f1,1e23cf7e86d61103 CC,CD 12.674167 ClusterRecoveryRecruitedInitialStorageServers Machine=[abcd::2:0:1:0]:1 ID=fa8a1263afcf46cf TargetCount=3 Servers=9ebeb0ec4d73b590,9017c4173c59562a,388525c3ebdfbdc2
SS e2ab1c6e52b9f1d5 Tag=1:0 TLog ID=7c14e355627b4bd6 LogId=0dd4dae63d12deae Recovery Txn Version = 274963058
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/DataLossRecovery.toml -s 944886588 -b on
697.491891 LoadBalanceTooLong ID=0000000000000000 SuppressedEventCount=9 Duration=70.1921 NumAttempts=25 Backoff=5 TriedAllOptions=1 RK,SS 697.491891 LoadBalanceTooLongEndpoint ID=0000000000000000 Addr=2.0.2.1:2 Token=a86e9362ce89e0b1 Failed=0 RK,SS 697.491891 LoadBalanceTooLongEndpoint ID=0000000000000000 Addr=2.1.1.14:1 Token=a42efc8e51c524a5 Failed=0 RK,SS 697.491891 LoadBalanceTooLongEndpoint ID=0000000000000000 Addr=2.2.1.13:1 Token=7086b4bd3dea23bd Failed=0 RK,SS
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/DiskFailureCycle.toml -s 716066061 -b off
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/SystemRebootTestCycle.toml -s 683832471 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/ConfigIncrementWithKills.toml -s 685413079 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/Unreadable.toml -s 146449410 -b on
SS 6d91e987c266b4e2 Tag=1:3 TLog ID=663a4f33b022d716 LogId=bb9259001529bb4b Recovery Txn Version=166788707
78.593985 Role ID=bb9259001529bb4b As=TLog Transition=Begin Origination=Recovered OnWorker=7e7b3e9e8a4f898b SharedTLog=663a4f33b022d716 78.594385 TLogRecover ID=663a4f33b022d716 LogId=bb9259001529bb4b At=66788707 Known=63030713 Unrecovered=63030714 Tags=[no items] Locality=1 LogRouterTags=0 78.594385 CommitQueueNewLog ID=663a4f33b022d716 LogId=bb9259001529bb4b Version=63030713 Committing=0 Commmitted=63030713 CD,LR,SS,TL 78.594385 TLogReady ID=bb9259001529bb4b AllTags=-1:2,1:0,1:1,1:2,1:3,-7:0,-7:1,-7:2 Locality=1 CD,LR,SS,TL 78.673767 TLogPeekMessages4 ID=663a4f33b022d716 LogId=bb9259001529bb4b Tag=1:3 ReqBegin=66788708 EndVer=166788707 MsgBytes=0 CD,LR,SS,TL
Assertion rollbackVersion >= data->storageVersion() failed
Recovery txn version: 166464321 SS cb7c3497b7179fed, Tag 1:2, Src TLog LogID 381e01d30d003f23
70.505393 SSTag ID=cb7c3497b7179fed MyTag=1:2 LR,SS,TL 76.824832 TLogPeekSingleNoHistory ID=cb7c3497b7179fed Tag=1:2 Begin=66464322 LR,SS,TL 76.824832 TLogPeekLocalBestOnly ID=cb7c3497b7179fed Tag=1:2 Begin=66464322 End=9223372036854775807 BestSet=2 BestSetStart=65867347 LogId=381e01d30d003f23 LR,SS,TL 77.486995 StorageServerSourceTLogID ID=cb7c3497b7179fed SourceTLogID=381e01d30d003f23a0704a6dd7a62461 LR,SS,TL
77.490765 SSPrivateMutation ID=cb7c3497b7179fed Mutation=code: SetValue param1: \xff\xff/globals/lastEpochEnd param2: A*\xf6\x03\x00\x00\x00\x00 Version=166464321 LR,SS,TL 77.490765 CodeCoverage ID=0000000000000000 File=/root/src/foundationdb/fdbserver/storageserver.actor.cpp Line=6369 Condition=true LR,SS,TL 77.490765 Rollback ID=cb7c3497b7179fed FromVersion=166464320 ToVersion=66464321 AtVersion=166464321 RestoredVersion=56775485 StorageVersion=161464320 LR,SS,TL
TLog: ID=c2c875f43e9ac00d LogId=381e01d30d003f23 76.766383 TLogRecover ID=c2c875f43e9ac00d LogId=381e01d30d003f23 At=66464321 Known=65867346 Unrecovered=65867347 Tags=[no items] Locality=1 LogRouterTags=0 LR,SS,TL 76.766383 CommitQueueNewLog ID=c2c875f43e9ac00d LogId=381e01d30d003f23 Version=65867346 Committing=0 Commmitted=65867346 LR,SS,TL 76.766383 TLogReady ID=381e01d30d003f23 AllTags=-1:2,1:0,1:1,1:2,-7:0,-7:1,-7:2 Locality=1 LR,SS,TL
76.790218 ClusterRecovery ID=e64b4f4a5ef2bba7 EndPair=08fd0d6da34d1201 RecoveryTransactionVersion=166464321 CC,CD,GP,SS,TL We could use CC to worker broadcast (i.e., ServerDBInfo) to piggyback RecoveryTransactionVersion to all TLogs, SSes, so that they can use the information to disallow or skip peeking partially [recoveredAt, RecoveryTransactionVersion).
77.244430 TLogPeekLocalBestOnly ID=cb7c3497b7179fed Tag=1:2 Begin=66464322 End=9223372036854775807 BestSet=2 BestSetStart=65867347 LogId=381 e01d30d003f23 LR,SS,TL 77.244430 SPC_StreamCreated ID=17c6d80a3452f443 PeerAddr=2.1.1.2:1 PeerToken=79f1da6102acb845 LR,SS,TL 77.244430 LogRouterPeekStream ID=fb12a7a07be9fbf5 Token=2459970830edfbcf LR,SS,TL 77.244430 LogRouterPeekStream ID=fb12a7a07be9fbf5 Token=2459970830edfbcf LR,SS,TL
A remote SS 8479c8eb4802fc0a peeks a remote TLog:
- Recovery brought SS to RecoveredAt=169381041, SS: Version= Tag=1:2 DurableVersion=
- SS peeks TLog
- TLog returns empty message with version 269381040, SS updates storage version to 264381040.
- SS peeks TLog again, got Version=269381041 with
\xff\xff/globals/lastEpochEnd
, tries to roll back to 67023141 < 162023140 and fails the assertionrollbackVersion >= data->storageVersion()
.
98.440118 MutationTracking ID=8479c8eb4802fc0a Label=Everything At=SSPeek Version=269381041 Mutation=code: SetValue param1: \xff\xff/globals/lastEpochEnd param2: \xb1\x8c\x18\x0a\x00\x00\x00\x00 LR,SS,TL 98.440118 SSPrivateMutation ID=8479c8eb4802fc0a Mutation=code: SetValue param1: \xff\xff/globals/lastEpochEnd param2: \xb1\x8c\x18\x0a\x00\x00\x00\x00 Version=269381041 LR,SS,TL 98.440118 CodeCoverage ID=0000000000000000 File=/root/src/foundationdb/fdbserver/storageserver.actor.cpp Line=6369 Condition=true LR,SS,TL 98.440118 Rollback ID=8479c8eb4802fc0a FromVersion=269381040 ToVersion=169381041 AtVersion=269381041 RestoredVersion=163401198 StorageVersion=264381040 LR,SS,TL
Remote TLog peek log router, with recovery txn 269381041: 97.111529 LogRouterPeek4 ID=eebb71edb1d14482 Tag=-3:2 ReqBegin=168810293 End=269381042 MessageSize=592 PoppedVersion=168774853 LR,SS,TL 97.111529 LogRouterPeek4 ID=eebb71edb1d14482 Tag=-3:1 ReqBegin=168810293 End=269381042 MessageSize=592 PoppedVersion=168774853 LR,SS,TL 97.111529 LogRouterPeek4 ID=eebb71edb1d14482 Tag=-3:0 ReqBegin=168810293 End=269381042 MessageSize=592 PoppedVersion=168774853 LR,SS,TL
However, this log router c344736aaa387dc8 return empty messages [168810292, 269381041) 97.102681 LogRouterPeek4 ID=c344736aaa387dc8 Tag=-3:2 ReqBegin=168810292 End=168810293 MessageSize=0 PoppedVersion=0 LR,SS,TL 97.102681 LogRouterPeek4 ID=c344736aaa387dc8 Tag=-3:1 ReqBegin=168810292 End=168810293 MessageSize=0 PoppedVersion=0 LR,SS,TL 97.102681 LogRouterPeek4 ID=c344736aaa387dc8 Tag=-3:0 ReqBegin=168810292 End=168810293 MessageSize=0 PoppedVersion=0 LR,SS,TL 97.111502 LogRouterPeek4 ID=c344736aaa387dc8 Tag=-3:2 ReqBegin=168810293 End=269381041 MessageSize=0 PoppedVersion=168810291 LR,SS,TL 97.111502 LogRouterPeek4 ID=c344736aaa387dc8 Tag=-3:1 ReqBegin=168810293 End=269381041 MessageSize=0 PoppedVersion=168810291 LR,SS,TL 97.111502 LogRouterPeek4 ID=c344736aaa387dc8 Tag=-3:0 ReqBegin=168810293 End=269381041 MessageSize=0 PoppedVersion=168810291 LR,SS,TL
Check this router: peeking primary TLOG 5becadc11d56d3dc, LogID 03bd3e7b13ee04aa (pullAsyncData() to get recovery versions up to recoveredAt)
97.073696 LogRouterStart ID=c344736aaa387dc8 Start=168810292 Tag=-2:0 Localities=3 Locality=1 LR,SS,TL
97.073696 SPC_Starting ID=8c6d7b77819e5b03 Tag=-2:0 UsePeekStream=1 Begin=168810292 End=9223372036854775807 LR,SS,TL
97.095662 LogRouterPeekStream ID=c344736aaa387dc8 Token=05f01930f2abf5db LR,SS,TL
97.099487 LogRouterPeekLocation ID=c344736aaa387dc8 LogID=03bd3e7b13ee04aa LR,SS,TL
97.099487 LogRouterPeekStream ID=c344736aaa387dc8 Token=05f01930f2abf5db LR,SS,TL
97.099487 TLogPeekStream ID=1751e36c85af1a47 Token=5d06f932ba27a9f7 LR,SS,TL
RecoveredAt version 169381041: 97.119618 MutationTracking ID=8dda02c21e4d4acc Label=Everything At=TLogCommitMessages Version=169381041 Mutation=code: SetValue param1: \xff/conf/perpetual_storage_wiggle param2: 0 MessageTags=-2:3 0:0 0:1 0:2 1:0 1:1 1:2 1:3 DebugID=d906fc33b8fa5acc LR,SS,TL
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/BackupCorrectnessClean.toml -s 987736578 -b on
A remote SS peeks a remote TLog:
- Recovery brought SS to RecoveredAt=67023141, SS: Version=67023141 Tag=1:2 DurableVersion=62023141
- SS peeks TLog
- TLog returns empty message with version 167023140, SS updates storage version to 162023140.
- SS peeks TLog again, got Version=167023141 with
\xff\xff/globals/lastEpochEnd
, tries to roll back to 67023141 < 162023140 and fails the assertionrollbackVersion >= data->storageVersion()
.
It seems after recovery, the first peek reply must get the recovery txn to avoid this problem. Indeed, Evan confirmed that "the tlog is not supposed to be able to return an empty version in the range between the RecoveredAt and the recovery version".
SS 2f8972c4c6c9436e, tag 1:2
Rollback assertion failure rollbackVersion >= data->storageVersion()
FromVersion=167023140 ToVersion=67023141 AtVersion=167023141 RestoredVersion=59705336 StorageVersion=162023140
81.171772 SSPrivateMutation Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e Mutation=code: SetValue param1: \xff\xff/globals/lastEpochEnd param2: %\xb1\xfe\x03\x00\x00\x00\x00 Version=167023141
This is from TLog: logId is a05740bb2004c026, dbgid f80f51a0acfe6ff5, Machine=[abcd::2:1:1:0]:1
Remote tlog calls pullAsyncData
from log routers, and log version changed at 167023140, with empty messages.
81.145295 TLogInfo3 Machine=[abcd::2:1:1:0]:1 ID=f80f51a0acfe6ff5 Log=a05740bb2004c026 Ver=167023140 LR,SS,TL
commitMessages():
81.170295 MutationTracking Machine=[abcd::2:1:1:0]:1 ID=a05740bb2004c026 Label=DDD At=TLogCommitMessages Version=167023141 Mutation=code: SetValue param1: \xff\xff/globals/lastEpochEnd param2: %\xb1\xfe\x03\x00\x00\x00\x00 MessageTags=-2:0 -1:2 0:0 0:1 0:2 0:3 1:0 1:1 1:2 1:3 DebugID=f80f51a0acfe6ff5
Check log routers, they return data with recovery txn with end version 167023142: 81.139853 LogRouterPeek4 Machine=[abcd::2:1:1:2]:1 ID=7698a2db1906dd27 Tag=-3:0 ReqBegin=67023142 End=167023142 MessageSize=283 PoppedVersio n=65180705 LR,SS,TL 81.139853 LogRouterPeek4 Machine=[abcd::2:1:1:2]:1 ID=7698a2db1906dd27 Tag=-3:1 ReqBegin=67023142 End=167023142 MessageSize=283 PoppedVersio n=65180705 LR,SS,TL 81.139853 LogRouterPeek4 Machine=[abcd::2:1:1:2]:1 ID=7698a2db1906dd27 Tag=-3:2 ReqBegin=67023142 End=167023142 MessageSize=283 PoppedVersio n=65180705 LR,SS,TL
However, TLog is waking up on version 167023140 with no data. PEEK_USING_STREAMING
is on. Turning it off seems to fix the crash, but Joshua finds other instances of the same failure. So it's not related to PEEK_USING_STREAMING
knob.
BufferedCursor::hasMessage
Both SS and TLog are in remote region.
78.228182 MasterRecoveredConfig Machine=[abcd::2:0:1:0]:1 ID=c75fd64bc723e243 LastEpochEnd=67023141 RecoveryTransactionVersion=167023141 78.228182 MasterRecoveredConfig Machine=[abcd::2:0:1:0]:1 ID=c75fd64bc723e243 Conf={backup_worker_enabled:0,blob_granules_enabled:0,commit_proxies:1,grv_proxies:2,log_routers:6,log_spill:2,log_version:6,perpetual_storage_wiggle:0,perpetual_storage_wiggle_locality:0,proxies:3,redundancy_mode:double,regions:[{datacenters:[{id:0,priority:2},{id:1,priority:1,satellite:1},{id:2,priority:1,satellite:1}]},{datacenters:[{id:1,priority:-1},{id:0,priority:1,satellite:1,satellite_logs:3},{id:2,priority:1,satellite:1,satellite_logs:6}]}],storage_engine:memory-2,storage_migration_type:disabled,tenant_mode:disabled,usable_regions:2}
passed SS: 81.149727 StorageUpdater Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e From=67023141 Restored=59705336 LR,SS,TL 81.149727 SSPrivateMutation Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e Mutation=code: SetValue param1: \xff\xff/globals/lastEpochEnd param2: %\xb1\xfe\x03\x00\x00\x00\x00 Version=167023141 LR,SS,TL 81.149727 RollbackSkip Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e FromVersion=67023141 ToVersion=67023141 AtVersion=167023141 RestoredVersion=59705336 StorageVersion=62023141
Failed SS: higher FromVersion, AtVersion and StorageVersion
81.147066 StorageUpdater Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e From=67023141 Restored=59705336 LR,SS,TL
81.147066 StorageServerSourceTLogID Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e SourceTLogID=a05740bb2004c02645709cabf3b1a9b6 LR,SS,TL
81.158723 FetchKeys Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e EndPair=928be7cca1c7805d LR,SS,TL
81.171772 StorageUpdater Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e From=167023140 Restored=59705336 LR,SS,TL 81.171772 SSPrivateMutation Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e Mutation=code: SetValue param1: \xff\xff/globals/lastEpochEnd param2: %\xb1\xfe\x03\x00\x00\x00\x00 Version=167023141 LR,SS,TL 81.171772 Rollback Machine=[abcd::2:1:1:2]:1 ID=2f8972c4c6c9436e FromVersion=167023140 ToVersion=67023141 AtVersion=167023141 RestoredVersion=59705336 StorageVersion=162023140
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/KillRegionCycle.toml -s 740096849 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/DataLossRecovery.toml -s 116137620 -b on
StorageServersRecruiting=1
AttemptedDoubleRecruitment ID=3db8ef5a4231786f ForRole=StorageServer
repeats forever (cmr | egrep 'StorageServerRecruitment|AttemptedDoubleRecruitment' | sl
).
So DD keeps trying to recruit a worker that already has an SS. Why DD doesn't know that an SS is running?
15.157044 Role ID=fdc1b79fe811eeb5 As=StorageServer Transition=Begin Origination=Recruited OnWorker=3db8ef5a4231786f IsTSS=0 StorageEngine=memory-radixtree-beta CD,SS 85.220829 Role ID=3ec6881e81cf175c As=StorageServer Transition=Begin Origination=Recruited OnWorker=3db8ef5a4231786f IsTSS=0 StorageEngine=memory-radixtree-beta CC,CD,CP,RV,SS,TL 97.246740 Role ID=4d4ec8f9e559696d As=StorageServer Transition=Begin Origination=Recruited OnWorker=3db8ef5a4231786f IsTSS=0 StorageEngine=memory-radixtree-beta CD,SS,TL
99.620345 SSTag ID=4d4ec8f9e559696d MyTag=1:0 CD,SS,TL
Check SS 4d4ec8f9e559696d, the addStorageServer()
txn committed at version 655685601.
97.397673 ServerTag ID=28115c1a5d973405 Server=4d4ec8f9e559696d Tag=1:0 CC,CP,RV,SS,TL
97.425861 AddedStorageServerSystemKey ID=0000000000000000 ServerID=4d4ec8f9e559696d CD,SS,TL
97.440010 InitTenantMap ID=4d4ec8f9e559696d Version=655685601 NumTenants=1 CD,SS,TL
97.440010 InsertTenant ID=4d4ec8f9e559696d Tenant=SimulatedDefaultTenant Version=655685601 CD,SS,TL
97.440010 KVSMemSwitchingToLargeTransactionMode ID=4d4ec8f9e559696d TransactionSize=78 DataSize=0 CD,SS,TL
97.440010 FullSnapshotEnd ID=4d4ec8f9e559696d PreviousSnapshotEndLoc=0.4196 SnapshotSize=319 SnapshotElements=7 CD,SS,TL
97.453257 StorageServerInit ID=4d4ec8f9e559696d Version=655685600 SeedTag=-1:0 TssPair= CD,SS,TL
97.453257 StorageServerRestoreDurableState ID=4d4ec8f9e559696d RestoredBytes=0 CD,SS,TL
97.453257 TLogPeekSingleNoHistory ID=4d4ec8f9e559696d Tag=1:0 Begin=655685601 CD,SS,TL
There is a recovery around that time:
97.204921 2.0.1.0:1 ID=5af0c85ea53be4f9 12 all_logs_recruited
98.051503 2.0.1.0:1 ID=4469a6de272e059f 0 reading_coordinated_state
Check if version 655685601 is committed or rolled back, which durable.
99.603445 MutationTracking ID=d9b0a5c52853755e Label=DDDD At=makeVersionDurable Version=655685601 Mutation=code: SetValue param1: \xff/serverTag/miY\xe5\xf9\xc8NM3\x84!@\x02Rh< param2: \x01\x00\x01b\xb0\x00\xdb\x0f\x01\x00\x00 SS
Check force recovery, which rolls back version to 553507006 < 655685601. The problem is that the storage server is not detecting that it is supposed to be dead, normally the peak of a tag that does not exist should result in an error, but it seems like in this case that is not happening.
100.659911 ForceRecovery_Begin ID=0000000000000000 TS
102.798896 ForcedRecoveryStart ID=43f2ca56d87e46df ClusterControllerDcId=1 DcId=1 CC,SS,TL
102.852328 MasterRecoveryState ID=2849670632279b9b StatusCode=1 Status=locking_coordinated_state TLogs=2 ActiveGenerations=7 MyRecoveryCount=22 ForceRecovery=0 CC,SS,TL
107.826180 MasterRecoveryState ID=24806395dace6b03 StatusCode=1 Status=locking_coordinated_state TLogs=2 ActiveGenerations=7 MyRecoveryCount=23 ForceRecovery=1 CC,SS,TL
107.826180 ForcedRecovery ID=24806395dace6b03 PrimaryLocality=0 RemoteLocality=1 FoundRemote=1 Modified=5 Removed=2 CC,SS,TL
107.826180 ForcedRecoveryTLogs ID=24806395dace6b03 I=0 Log=feee0613971272dd Loc=1 Txs=2 CC,SS,TL
107.826180 ForcedRecoveryTLogs ID=24806395dace6b03 I=0 J=0 Log=3e0dd9e06d96b747,5dafa0c4b049081d Loc=1 Txs=2 CC,SS,TL
107.826180 ForcedRecoveryTLogs ID=24806395dace6b03 I=1 J=0 Log=f7457e14247a2a90,5c31ab4fcf48bad9 Loc=1 Txs=2 CC,SS,TL
107.826180 ForcedRecoveryTLogs ID=24806395dace6b03 I=2 J=0 Log=a03d348476c73a7f,752cc270a377b4b5 Loc=1 Txs=2 CC,SS,TL
107.826180 ForcedRecoveryTLogs ID=24806395dace6b03 I=3 J=0 Log=c9b4431307a241ac,28a6cdf2ff751c3e Loc=1 Txs=2 CC,SS,TL
107.826180 HigherRecoveryVersion ID=24806395dace6b03 Idx=0 Ver=553607005 CC,SS,TL
107.829979 FinalRecoveryVersionInfo ID=0000000000000000 KCV=553507005 MinEnd=553607005 CC,SS,TL
107.856502 ForcedRecoveryConfigChange ID=24806395dace6b03 Conf={backup_worker_enabled:0,blob_granules_enabled:0,commit_proxies:1,grv_proxies:2,log_spill:2,perpetual_storage_wiggle:1,perpetual_storage_wiggle_locality:data_hall:1,proxies:3,redundancy_mode:single,regions:[{datacenters:[{id:1,priority:1},{id:3,priority:1,satellite:1,satellite_logs:5}],satellite_redundancy_mode:one_satellite_double},{datacenters:[{id:0,priority:-1},{id:2,priority:1,satellite:1}],satellite_redundancy_mode:one_satellite_double}],remote_logs:3,remote_redundancy_mode:remote_single,storage_engine:memory-radixtree-beta,storage_migration_type:disabled,tenant_mode:required_experimental,usable_regions:1} CC,SS,TL
107.924497 BWToCore ID=0000000000000000 Epoch=21 TotalTags=0 BeginVersion=452349458 EndVersion=553507006 CC,CP,RV,SS,TL
107.924497 BWToCore ID=0000000000000000 Epoch=14 TotalTags=0 BeginVersion=351995786 EndVersion=452349458 CC,CP,RV,SS,TL
107.924497 BWToCore ID=0000000000000000 Epoch=10 TotalTags=0 BeginVersion=251388182 EndVersion=351995786 CC,CP,RV,SS,TL
107.924497 BWToCore ID=0000000000000000 Epoch=8 TotalTags=0 BeginVersion=150126374 EndVersion=251388182 CC,CP,RV,SS,TL
107.924497 BWToCore ID=0000000000000000 Epoch=5 TotalTags=0 BeginVersion=31806742 EndVersion=150126374 CC,CP,RV,SS,TL
107.961837 BWToCore ID=0000000000000000 Epoch=21 TotalTags=0 BeginVersion=452349458 EndVersion=553507006 CC,CP,RV,SS,TL
107.961837 BWToCore ID=0000000000000000 Epoch=14 TotalTags=0 BeginVersion=351995786 EndVersion=452349458 CC,CP,RV,SS,TL
107.961837 BWToCore ID=0000000000000000 Epoch=10 TotalTags=0 BeginVersion=251388182 EndVersion=351995786 CC,CP,RV,SS,TL
107.961837 BWToCore ID=0000000000000000 Epoch=8 TotalTags=0 BeginVersion=150126374 EndVersion=251388182 CC,CP,RV,SS,TL
107.961837 BWToCore ID=0000000000000000 Epoch=5 TotalTags=0 BeginVersion=31806742 EndVersion=150126374 CC,CP,RV,SS,TL
107.991475 ForcedRecoveryFinish ID=43f2ca56d87e46df CC,CP,RV,SS,TL
Now check why SS peek doesn't cause it to exit.
TLog starts without 1:0 tag. 107.882063 TLogRecover ID=f4e1c30d6e2714c3 LogId=d98969121564982f At=553607005 Known=553507005 Unrecovered=553507006 Tags=-1:2,1:2,-7:0 Locality=-1 LogRouterTags=0 CC,CP,RV,SS,TL
SS peeks with its own version, which > recovery version. However, TLog replies with end version 600553607006. 107.965939 SSDDD ID=4d4ec8f9e559696d RecoverAt=553607005 DurableVersion=655685600 Version=655685600 CD,DD,GP,MS,RK,SS,TL 107.965939 TLogPeekSingleNoHistory ID=4d4ec8f9e559696d Tag=1:0 Begin=655685601 CD,DD,GP,MS,RK,SS,TL 107.965939 TLogPeekLocalBestOnly ID=4d4ec8f9e559696d Tag=1:0 Begin=655685601 End=9223372036854775807 BestSet=0 BestSetStart=553507006 LogId=d98969121564982f CD,DD,GP,MS,RK,SS,TL 107.965939 SSDDD ID=4d4ec8f9e559696d DurableVersion=655685600 Version=655685600 CD,DD,GP,MS,RK,SS,TL 107.967472 TLogPeekMessages0 ID=f4e1c30d6e2714c3 ReqBegin=655685601 Tag=1:0 ReturnIfBlocked=0 OnlySpilled=0 CC,CP,RV,SS,TL 107.967472 TLogPeekMessages1 ID=f4e1c30d6e2714c3 ReqBegin=655685601 Tag=1:0 PoppedVer=553607006 CC,CP,RV,SS,TL 107.967472 TLogPeek ID=f4e1c30d6e2714c3 LogId=d98969121564982f Tag=1:0 BeginVer=655685601 EndVer=600553607006 MsgBytes=0 CC,CP,RV,SS,TL
107.967472 TLogPeekMessages1 ID=f4e1c30d6e2714c3 ReqBegin=655685601 Tag=1:0 PoppedVer=553607006 CC,CP,RV,SS,TL Machine="2.1.1.2:1"
107.969459 SSDDD ID=4d4ec8f9e559696d RecoverAt=553607005 DurableVersion=655685600 Version=600553607005 CD,DD,GP,MS,RK,SS,TL
107.969459 TLogPeekSingleNoHistory ID=4d4ec8f9e559696d Tag=1:0 Begin=600553607006 CD,DD,GP,MS,RK,SS,TL
107.969459 TLogPeekLocalBestOnly ID=4d4ec8f9e559696d Tag=1:0 Begin=600553607006 End=9223372036854775807 BestSet=0 BestSetStart=553507006 LogId=d98969121564982f CD,DD,GP,MS,RK,SS,TL
107.969459 SSDDD ID=4d4ec8f9e559696d DurableVersion=655685600 Version=600553607005 CD,DD,GP,MS,RK,SS,TL
107.970327 SSDDD ID=4d4ec8f9e559696d RecoverAt=553607005 DurableVersion=655685600 Version=600553607005 CD,DD,GP,MS,RK,SS,TL
107.970327 TLogPeekSingleNoHistory ID=4d4ec8f9e559696d Tag=1:0 Begin=600553607006 CD,DD,GP,MS,RK,SS,TL
107.970327 TLogPeekLocalBestOnly ID=4d4ec8f9e559696d Tag=1:0 Begin=600553607006 End=9223372036854775807 BestSet=0 BestSetStart=553507006 LogId=d98969121564982f CD,DD,GP,MS,RK,SS,TL
107.970327 SSDDD ID=4d4ec8f9e559696d DurableVersion=655685600 Version=600553607005 CD,DD,GP,MS,RK,SS,TL
107.971342 TLogPeekMessages0 ID=f4e1c30d6e2714c3 ReqBegin=600553607006 Tag=1:0 ReturnIfBlocked=0 OnlySpilled=0 CC,CP,RV,SS,TL
107.971342 TLogPeekMessages0 ID=f4e1c30d6e2714c3 ReqBegin=600553607006 Tag=1:0 ReturnIfBlocked=0 OnlySpilled=0 CC,CP,RV,SS,TL
- SS reached a higher version B.
- Recovery rolled version back to A < B
- SS peeks the new tlog with its begin version B+1, which > tlog's recoveredAt + 1 == A + 1
- Because of poppedVer = A + 1 < reqBegin, we didn't enter this block to set rep.popped. Instead, we return with a higher version C here.
- To SS, the messages are empty, thus no chance of removing itself, and SS keeps peeking.
Fixed in PR #6886
Commit Proxy (CP) stuck in commit:
93.079959 CommitAttachID ID=0000000000000005 To=0bd971baa212eddc TS
93.079959 CommitDebug ID=0bd971baa212eddc Location=NativeAPI.commit.Before TS
93.080949 CommitDebug ID=0bd971baa212eddc Location=CommitProxyServer.batcher CD,CP,SS
93.081455 CommitAttachID ID=0bd971baa212eddc To=2f2b305bb2ef7e96 CD,CP,SS
93.081455 CommitDebug ID=2f2b305bb2ef7e96 Location=CommitProxyServer.commitBatch.Before CD,CP,SS
93.081455 CommitDebug ID=2f2b305bb2ef7e96 Location=CommitProxyServer.commitBatch.GettingCommitVersion CD,CP,SS
93.083533 CommitDebug ID=2f2b305bb2ef7e96 Location=CommitProxyServer.commitBatch.GotCommitVersion CD,CP,SS
93.083630 SlowSSLoopx100 ID=2d100fa4a30020e0 Elapsed=0.0705358 SS,TL
93.085507 CommitAttachID ID=2f2b305bb2ef7e96 To=7538ae1ba8d5f297 RV,SS
93.085507 CommitDebug ID=7538ae1ba8d5f297 Location=Resolver.resolveBatch.Before RV,SS
93.085507 CommitDebug ID=7538ae1ba8d5f297 Location=Resolver.resolveBatch.AfterQueueSizeCheck RV,SS
93.085507 CommitDebug ID=7538ae1ba8d5f297 Location=Resolver.resolveBatch.AfterOrderer RV,SS
93.085507 CommitDebug ID=7538ae1ba8d5f297 Location=Resolver.resolveBatch.After RV,SS
93.086488 CommitDebug ID=2f2b305bb2ef7e96 Location=CommitProxyServer.commitBatch.AfterResolution CD,CP,SS
Time="97.439628" DateTime="2022-04-18T16:23:02Z" Type="ProxyMetrics" Machine="2.0.1.1:1" ID="13a97ed486be5771" Elapsed="5" TxnCommitIn="0.4 0.992447 398" TxnCommitVersionAssigned="0.4 0.991717 398" TxnCommitResolving="0.4 0.991717 398" TxnCommitResolved="0 -1 396" TxnCommitOut="0 -1 396" TxnCommitOutSuccess="0 -1 68" TxnCommitErrors="0 -1 0" TxnConflicts="0 -1 328" TxnRejectedForQueuedTooLong="0 -1 0" CommitBatchIn="0.4 0.970729 270" CommitBatchOut="0 -1 267" MutationBytes="0 -1 65574" Mutations="0 -1 260" ConflictRanges="1 3.97929 6569" KeyServerLocationIn="0 -1 543" KeyServerLocationOut="0 -1 543" KeyServerLocationErrors="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" LastAssignedCommitVersion="76530041" Version="71360611" CommittedVersion="62315900" CommitBatchesMemBytesCount="914" MaxCompute="0" MinCompute="1000000000000" Time="102.439628" DateTime="2022-04-18T16:23:03Z" Type="ProxyMetrics" Machine="2.0.1.1:1" ID="13a97ed486be5771" Elapsed="5" TxnCommitIn="0 -1 398" TxnCommitVersionAssigned="0 -1 398" TxnCommitResolving="0 -1 398" TxnCommitResolved="0 -1 396" TxnCommitOut="0 -1 396" TxnCommitOutSuccess="0 -1 68" TxnCommitErrors="0 -1 0" TxnConflicts="0 -1 328" TxnRejectedForQueuedTooLong="0 -1 0" CommitBatchIn="0 -1 270" CommitBatchOut="0 -1 267" MutationBytes="0 -1 65574" Mutations="0 -1 260" ConflictRanges="0 -1 6569" KeyServerLocationIn="0 -1 543" KeyServerLocationOut="0 -1 543" KeyServerLocationErrors="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" LastAssignedCommitVersion="76530041" Version="71360611" CommittedVersion="62315900" CommitBatchesMemBytesCount="914" MaxCompute="0" MinCompute="1000000000000"
105.167700 CommitProxyTerminated ID=a60979491ef5cdef Error=failed_to_progress ErrorDescription=Process has failed to make sufficient progress ErrorCode=1216 CP,SS
Check CP a60979491ef5cdef, the version stuck at 67825749. Enable mutation tracking for version >= 67825749.
It seems the workload uses wait(timeoutError(tr.commit(), 30.0));
, which caused the test failure. If I just use wait(tr.commit());
, the test passes.
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_7.0.0/SnapTestAttrition-1.txt -s 907803276 -b off
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/AtomicOpsApiCorrectness.toml -s 711988940 -b on
379.460270 RecoveryDelayedTooManyOldGenerations ID=0000000000000000 OldGenerations=71 Reason=Recovery is delayed because too many recoveries have happened since the last time the cluster was fully_recovered.
Recoveries were due to new CC elected, or BetterMasterExists.
Type="SetupAndRunError" Machine="0.0.0.0:0" ID="0000000000000000" Error="timed_out" ErrorDescription="Operation timed out"
137.914204 QuietDatabaseWaitingOnFullRecovery ID=0000000000000000 TS
117.656170 2.1.1.0:1 ID=bc1320301959cb13 0 reading_coordinated_state
117.676096 2.1.1.0:1 ID=bc1320301959cb13 1 locking_coordinated_state TLogs=1 ActiveGenerations=3 MyRecoveryCount=15 ForceRecovery=0
117.715558 2.1.1.0:1 ID=bc1320301959cb13 3 reading_transaction_system_state
117.723328 2.1.1.0:1 ID=bc1320301959cb13 7 recruiting_transaction_servers Conf={"backup_worker_enabled":0,"blob_granules_enabled":0,"log_spill":2,"log_version":7,"perpetual_storage_wiggle":0,"perpetual_storage_wiggle_locality":"data_hall:1","redundancy_mode":"three_data_hall","storage_engine":"memory-radixtree-beta","storage_migration_type":"disabled","tenant_mode":"optional_experimental","usable_regions":1} RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1
117.723328 2.1.1.0:1 ID=bc1320301959cb13 8 initializing_transaction_servers CommitProxies=1 GrvProxies=1 TLogs=4 Resolvers=1 SatelliteTLogs=0 OldLogRouters=0 StorageServers=0 BackupWorkers=0 PrimaryDcIds= RemoteDcIds=
118.328984 2.1.1.0:1 ID=bc1320301959cb13 9 recovery_transaction PrimaryLocality=-1 DcId=1 ClusterId=21702ee4b56eb66d
118.347734 2.1.1.0:1 ID=bc1320301959cb13 10 writing_coordinated_state TLogList=0: 431ef21e0dee2c48eabf44a0993ecfac, f3b0175d623748d171378a01be606b6c, 6a0be6966704eb9c93766110fd9b9cb2, 91f2b61ee2665554b6cf1fdfd76e7017
118.379219 2.1.1.0:1 ID=bc1320301959cb13 11 accepting_commits StoreType=memory-radixtree-beta RecoveryDuration=0.723048
118.379219 2.1.1.0:1 ID=bc1320301959cb13 12 all_logs_recruited
DB was not fully recovered, and SnapTest requires fully recovered state.
117.723328 MasterRecoveredConfig ID=bc1320301959cb13 LastEpochEnd=595644423 RecoveryTransactionVersion=695644423 CC,DD,MS,RK
117.723328 MasterRecoveredConfig ID=bc1320301959cb13 Conf={backup_worker_enabled:0,blob_granules_enabled:0,log_spill:2,log_version:7,perpetu
al_storage_wiggle:0,perpetual_storage_wiggle_locality:data_hall:1,redundancy_mode:three_data_hall,storage_engine:memory-radixtree-beta,storage_migration_type:disabled,tenant_mode:optional_experimental,usable_regions:1} CC,DD,MS,RK
76.692883 TrackTlogRecovery ID=0000000000000000 FinalUpdate=1 NewState.tlogs=1 NewState.OldTLogs=0 Expected.tlogs=1 CC,CP,DD,GP,MS,RK,RV
104.550269 TrackTlogRecovery ID=0000000000000000 FinalUpdate=0 NewState.tlogs=1 NewState.OldTLogs=1 Expected.tlogs=1 CC,CP,RV,SS,TL
109.358887 TrackTlogRecovery ID=0000000000000000 FinalUpdate=0 NewState.tlogs=1 NewState.OldTLogs=2 Expected.tlogs=1 CC,CP,GP,MS,RV
118.347734 TrackTlogRecovery ID=0000000000000000 FinalUpdate=0 NewState.tlogs=1 NewState.OldTLogs=3 Expected.tlogs=1 CC,CP,DD,GP,MS,RK,RV
TLog unpopped tag -1:0 version 0
5399.511351 TLogMetrics ID=a2c7044da829aa18 Elapsed=5 BytesInput=0 -1 55279822 BytesDurable=0 -1 0 BlockingPeeks=0 -1 0 BlockingPeekTimeouts=0 -1 0 EmptyPeeks=0 -1 0 NonEmptyPeeks=0 -1 0 Version=488316341 QueueCommittedVersion=488316341 PersistentDataVersion=385611487 PersistentDataDurableVersion=385611487 KnownCommittedVersion=488272049 QueuePoppedVersion=0 MinPoppedTagVersion=0 MinPoppedTagLocality=-1 MinPoppedTagId=0 SharedBytesInput=55279822 SharedBytesDurable=0 SharedOverheadBytesInput=29143525 SharedOverheadBytesDurable=0 PeekMemoryReserved=0 PeekMemoryRequestsStalled=0 Generation=11
- init( PEEK_USING_STREAMING, true ); if( randomize && BUGGIFY ) PEEK_USING_STREAMING = false;
+ init( PEEK_USING_STREAMING, false );
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/KillRegionCycle.toml -s 138443257 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_7.0.0/SnapTestAttrition-1.txt -s 73141792 -b off
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_7.0.0/SnapTestAttrition-1.txt -s 123231796 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/DataLossRecovery.toml -s 820266129 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/TenantManagement.toml -s 935988434 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/BlobGranuleVerifySmallClean.toml -s 934167474 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/KillRegionCycle.toml -s 783190175 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ApiCorrectness.toml -s 674539537 -b on
Looks like infinite loop at getConsistentReadVersion
, where isCurrentGrvProxy()
returns false.
So the clientInfo only has one proxy, and the GRV is sent to the other proxy. They don't match.
Time="119.310459" DateTime="2022-04-16T01:08:56Z" Type="TransactionAttachID" ID="0000000000000005" To="6bf56a6f2174759c" Machine="3.4.3.5:1" LogGroup="default" Roles="TS" Time="119.311561" Type="BGV_Current" Machine="3.4.3.5:1" ID="941c41791a208b9f" ClientInfoProxy0Of1="d0753cdbbf6a67e1"
shrinkProxyList
set proxy to 1 == MAX_GRV_PROXY_CONNECTIONS
:
54.739141 ConnectToDatabase ID=e6e05a6833aba40b Version=7.1.2 ClusterFile=file://simfdb/1359984f07a7e298058ba407ca310b19/fdb.cluster ConnectionString=TestCluster:0@fakeCoordinatorDC0M0:1,fakeCoordinatorDC0M1:1,fakeCoordinatorDC0M2:1,fakeCoordinatorDC1M0:1,fakeCoordinatorDC2M0:1,fakeCoordinatorDC2M1:1,fakeCoordinatorDC3M0:1,fakeCoordinatorDC3M1:1 ClientLibrary=/root/fdbserver Primary=1 Internal=1 TS
54.744823 ConnectedGrvProxy ID=0000000000000000 GrvProxy=393d1558a2df4419 TS
56.092533 ConnectedGrvProxy ID=0000000000000000 GrvProxy=941c41791a208b9f TS
56.094075 ConnectedGrvProxy ID=0000000000000000 GrvProxy=941c41791a208b9f TS
108.590850 ConnectedGrvProxy ID=0000000000000000 GrvProxy=393d1558a2df4419 TS
108.595318 ConnectedGrvProxy ID=0000000000000000 GrvProxy=d0753cdbbf6a67e1 TS
108.595318 ConnectedGrvProxy ID=0000000000000000 GrvProxy0-1=d0753cdbbf6a67e1 ClientInfoID=ea1ec7f11d54d08b TS
shrinkProxyList(ni, lastCommitProxyUIDs, lastCommitProxies, lastGrvProxyUIDs, lastGrvProxies);
clientInfo->set(ni);
Because the ClientDBInfo::operator==
is defined to be comparing id == r.id
. The shrinking doesn't change the ni.id
, thus clientInfo->set(ni);
can have no effect if a different set of GRV proxies are used, which is not a big problem before. However, when we are comparing GRV response with proxy ID, that ID can be from one not in the clientInfo
, because of the update is not performed.
10.902667 GrvProxyServerCore ID=1dc3e5c8231f97e7 MasterId=654ae078a73f1793 MasterLifetime=87a2d43b2d62d095#1 RecoveryCount=2 CD,CP,GP
10.902904 GrvProxyServerCore ID=436d05c5045a492f MasterId=654ae078a73f1793 MasterLifetime=87a2d43b2d62d095#1 RecoveryCount=2 CD,GP,MS,SS
12.447295 GrvProxyServerCore ID=393d1558a2df4419 MasterId=d57f4c8dd325be4b MasterLifetime=87a2d43b2d62d095#2 RecoveryCount=3 CD,CP,GP,SS,TL
12.448669 GrvProxyServerCore ID=d9f44709a24351b9 MasterId=d57f4c8dd325be4b MasterLifetime=87a2d43b2d62d095#2 RecoveryCount=3 CD,GP,MS,TL
55.925215 GrvProxyServerCore ID=941c41791a208b9f MasterId=d2b096831eb93b25 MasterLifetime=87a2d43b2d62d095#3 RecoveryCount=0 BK,CC,CD,GP,RV,SS,TL
55.927475 GrvProxyServerCore ID=d0753cdbbf6a67e1 MasterId=d2b096831eb93b25 MasterLifetime=87a2d43b2d62d095#3 RecoveryCount=5 CD,CP,DD,GP,RK,
cmr | egrep 'ConsistencyCheck|QuietDatabase|MaxTLogQueue|MaxStorageServer' | s | less
19543.987412 MaxTLogQueueSize ID=0000000000000000 Stage=ContactingLogs
19543.993452 MaxTLogQueueSize ID=0000000000000000 Stage=ComputingMax MessageCount=5
19544.006081 QuietDatabaseGotMaxStorageServerQueueSize ID=0000000000000000 Stage=MaxComputed Max=6727010 MaxQueueServer=ccab656c077b9888
19544.006081 QuietDatabaseEnd ID=0000000000000000 DataInFlight=0 DataInFlightGate=0 MaxTLogQueueSize=0 MaxTLogQueueGate=2000000 MaxTLogPoppedVersionLag=19341961249 MaxTLogPoppedVersionLagGate=30000000 DataDistributionQueueSize=1 DataDistributionQueueSizeGate=0 TeamCollectionValid=1 MaxStorageQueueSize=6727010 MaxStorageServerQueueGate=2000000 DataDistributionActive=1 StorageServersRecruiting=0 RecoveryCount=2 VersionOffset=0 NumSuccesses=0
MaxTlogQueue: 19341961249 > 30m, MaxStorageQueueSize: 6727010 > 2m SS ID=ccab656c077b9888 Tag=1:1
Enable TLogPopDetails
577.039362 TLogPopDetails ID=c390930570cbb5e4 Tag=0:1 UpTo=552477014 PoppedVersionLag=428477260 MinPoppedTag=-2:0 QueuePoppedVersion=125931440 UnpoppedRecovered=False NothingPersistent=True
Check tag -2:0, we have backup worker & log router 14.473551 LogRouterStart ID=79e8e8ac660cbcd1 Start=1 Tag=-2:0 Localities=2 Locality=1 LR 23.281830 BackupWorkerStart ID=8619226c98a12ba7 Tag=-2:0 TotalTags=3 StartVersion=1 EndVersion=-1 LogEpoch=2 BackupEpoch=2
LogRouter was not popping, stuck at PoppedVersion=125931440 414.473551 LogRouterMetrics ID=79e8e8ac660cbcd1 Elapsed=5 GetMoreCount=0 -1 263 GetMoreBlockedCount=0 -1 263 Version=126995632 MinPopped=125995632 FetchedVersions=1000000 MinKnownCommittedVersion=126131881 PoppedVersion=125931440 FoundEpochEnd=1 WaitForVersionMS=0 WaitForVersionMaxMS=0 GetMoreMS=0 GetMoreMaxMS=0 Generation=2 ActivePeekStreams=2 PrimaryPeekLocation=4e43b72934b37b91 RouterTag=-2:0 ... 8019.473551 LogRouterMetrics ID=79e8e8ac660cbcd1 Elapsed=5 GetMoreCount=0 -1 263 GetMoreBlockedCount=0 -1 263 Version=126995632 MinPopped=125995632 FetchedVersions=1000000 MinKnownCommittedVersion=126131881 PoppedVersion=125931440 FoundEpochEnd=1 WaitForVersionMS=0 WaitForVersionMaxMS=0 GetMoreMS=0 GetMoreMaxMS=0 Generation=2 ActivePeekStreams=2 PrimaryPeekLocation=4e43b72934b37b91 RouterTag=-2:0
Looks like remote tlog is not popping log routers.
startingConfiguration:new backup_worker_enabled:=1 blob_granules_enabled:=0 grv_proxies:=2 log_spill:=2 log_version:=7 logs:=4 perpetual_storage_wiggle:=0 proxies:=2 double remote_single resolvers:=2 ssd-2 storage_migration_type=disabled tenant_mode=required_experimental usable_regions:=2 regions=[{"datacenters":[{"id":"0","priority":2},{"id":"2","priority":1,"satellite":1}]},{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1,"satellite_logs":5}]}] start
DC ID 1 is the remote.
cmr | egrep 'TLogMetrics|TLogPop|PeekMetrics' | grep 9fee2fe60467c89c | s | less
8019.488211 TLogMetrics ID=9fee2fe60467c89c Elapsed=5 BytesInput=0 -1 13226145 BytesDurable=0 -1 13226145 BlockingPeeks=0 -1 0 BlockingPeekTimeouts=0 -1 0 EmptyPeeks=0 -1 0 NonEmptyPeeks=0 -1 0 Version=125995632 QueueCommittedVersion=125995632 PersistentDataVersion=125995632 PersistentDataDurableVersion=125995632 KnownCommittedVersion=125931440 QueuePoppedVersion=123638158 MinPoppedTagVersion=123638158 MinPoppedTagLocality=1 MinPoppedTagId=2
Looks like this TLog version is stuck at 125995632
. However, log routers have higher versions.
BufferedCursor 23.296039 TLogPeekRemoteBestOnly ID=9fee2fe60467c89c Tag=-3:1 Begin=2176502 End=9223372036854775807 BestSet=1 BestSetStart=1 LogRouterIds=79e8e8ac660cbcd11b83de37738eed25, 801bffeca9f9df2eb9331c5f6a8360a6, 21c4015c5335022160f5c34ca0d9f4e6 LR,SS,TL 23.296039 SPC_BGM0 ID=085d6b258b023c2e MessageVersion=2176502 End=9223372036854775807 LR,SS,TL 23.296039 SPC_StreamCreated ID=0ec68a4a745a9103 Tag=-3:1 ReturnIfBlocked=0 PeerAddr=2.1.1.1:1 PeerToken=ae2ea78c938a1a99 LR,SS,TL 23.296039 SPC_StreamCreated ID=7abe8962cb84468a Tag=-3:1 ReturnIfBlocked=0 PeerAddr=2.1.1.0:1 PeerToken=1e5b9299bbb9855b LR,SS,TL 23.296039 SPC_StreamCreated ID=5548f53bfe15ff45 Tag=-3:1 ReturnIfBlocked=0 PeerAddr=2.1.1.2:1 PeerToken=0524fe607c1a32eb LR,SS,TL
146.836426 SPC_BGM0 ID=c2ce743cacff33e3 MessageVersion=126217254 End=9223372036854775807 LR,SS,TL 146.836426 SPC_GetMore ID=a177ea68b485aa67 HasMessage=0 More=0 MessageVersion=126217254.0 End=9223372036854775807.0 LR,SS,TL 146.836426 SPC_GetMore ID=cc6c43e607613e05 HasMessage=0 More=0 MessageVersion=126217254.0 End=9223372036854775807.0 LR,SS,TL 146.836426 SPC_GetMore ID=5ef99fabb6ed6bea HasMessage=0 More=0 MessageVersion=126217254.0 End=9223372036854775807.0 LR,SS,TL
cmr | egrep '9fee2fe60467c89c|TLogPeekRemoteBestOnly|SPC_StreamCreated|085d6b258b023c2e|0ec68a4a745a9103|7abe8962cb84468a|5548f53bfe15ff45|LogRouterPeek.*21c4015c53350221|LogRouterPeekPopped' | s | egrep -v 'TLogVersion0After|SPC_NextMessage|Tag=-3:0' | less
the bug in how log routers handles streaming read:
- Log router has a
logRouterPeekStream
actor A running. - Remote tlog detects some problem and starts another streaming connection (maybe just reuse the connection?)
- Log router now has a new
logRouterPeekStream
actor B running. - B runs and found that popped version > reqBegin, so
LogRouterPeekPopped
. This is because A is still running and changed the popped version. - A ends with
TLogPeekStreamEnd operation_obsolete
- B become stuck at
wait(req.reply.onReady() && store(reply.rep, future)
, because the future was sentNever()
. The fix may be cache logRouterPeekStream actors so that we'll cancel old ones if a new streaming request comes in
146.839332 LogRouterPeekStream ID=21c4015c53350221 Token=0524fe607c1a32eb LR,SS 146.839332 LogRouterPeek1 ID=21c4015c53350221 Ver=126217253 Begin=125978220 Tag=-3:1 LR,SS 146.839332 LogRouterPeek4 ID=21c4015c53350221 Tag=-3:1 Begin=125978220 ReplyMaxKnownVersion=126217253 EndVersion=125995307 Popped=125931440 SeqPresent=0 MinKCV=124995632 LR,SS 146.839332 LogRouterPeek1 ID=21c4015c53350221 Ver=126217253 Begin=125995307 Tag=-3:1 LR,SS 146.839332 LogRouterPeekPopped ID=21c4015c53350221 Begin=125995307 Popped=125995632 Start=1 Tag=-3:1 LR,SS 146.842813 SPC_GetMoreB ID=5548f53bfe15ff45 Has=0 End=125995307 Popped=125931440 LR,SS,TL 146.842813 SPC_GetMore ID=5548f53bfe15ff45 HasMessage=0 More=1 MessageVersion=125995633.0 End=9223372036854775807.0 LR,SS,TL 147.189569 SPC_GetMoreB ID=0ec68a4a745a9103 Has=0 End=126931441 Popped=125931440 LR,SS,TL 147.189569 SPC_GetMore ID=0ec68a4a745a9103 HasMessage=0 More=1 MessageVersion=126931441.0 End=9223372036854775807.0 LR,SS,TL 147.192539 LogRouterPeek4 ID=21c4015c53350221 Tag=-3:1 Begin=126217254 ReplyMaxKnownVersion=126931440 EndVersion=126931441 Popped=125931440 SeqPresent=0 MinKCV=124995632 LR,SS 147.193039 LogRouterPeek1 ID=21c4015c53350221 Ver=126931440 Begin=126931441 Tag=-3:1 LR,SS 147.193039 TLogPeekStreamEnd ID=21c4015c53350221 Error=operation_obsolete ErrorDescription=Operation result no longer necessary ErrorCode=1002 PeerAddr=2.1.1.1:1 LR,SS
146.842813 SPC_GetMoreB ID=5548f53bfe15ff45 Has=0 End=125995307 Popped=125931440 LR,SS,TL 146.842813 SPC_GetMore ID=5548f53bfe15ff45 HasMessage=0 More=1 MessageVersion=125995633.0 End=9223372036854775807.0 LR,SS,TL ^^^^ 1M below the other two 147.537550 SPC_GetMore ID=7abe8962cb84468a HasMessage=0 More=1 MessageVersion=126995633.0 End=9223372036854775807.0 LR,SS,TL 147.537585 SPC_GetMore ID=0ec68a4a745a9103 HasMessage=0 More=1 MessageVersion=126995633.0 End=9223372036854775807.0 LR,SS,TL
Checked log routers have returned higher versions: 147.535274 LogRouterPeek4 ID=801bffeca9f9df2e Tag=-3:1 ReplyMaxKnownVersion=126995632 EndVersion=126995633 Popped=125995632 MinKCV=125931440 LR,SS,TL 147.535274 LogRouterPeek4 ID=801bffeca9f9df2e Tag=-3:0 ReplyMaxKnownVersion=126995632 EndVersion=126995633 Popped=125995632 MinKCV=125931440 LR,SS,TL 147.536561 LogRouterPeek4 ID=21c4015c53350221 Tag=-3:0 ReplyMaxKnownVersion=126995632 EndVersion=126995633 Popped=125995632 MinKCV=124995632 LR,SS ^^^ missing -3:1 for this router 147.537585 LogRouterPeek4 ID=79e8e8ac660cbcd1 Tag=-3:1 ReplyMaxKnownVersion=126995632 EndVersion=126995633 Popped=125995632 MinKCV=124995632 LR,SS,TL 147.537585 LogRouterPeek4 ID=79e8e8ac660cbcd1 Tag=-3:0 ReplyMaxKnownVersion=126995632 EndVersion=126995633 Popped=125995632 MinKCV=124995632 LR,SS,TL
cmr | egrep 'TLogMetrics|TLogPop|PeekMetrics' | grep 8f57933a091a9efd | s | less
8019.489985 TLogMetrics ID=8f57933a091a9efd Elapsed=5 BytesInput=0 -1 16847748 BytesDurable=0 -1 16847748 BlockingPeeks=0 -1 0 BlockingPeekTimeouts=0 -1 0 EmptyPeeks=0 -1 0 NonEmptyPeeks=0 -1 0 Version=126995632 QueueCommittedVersion=126995632 PersistentDataVersion=126995632 PersistentDataDurableVersion=126995632 KnownCommittedVersion=125995632 QueuePoppedVersion=123638158 MinPoppedTagVersion=123638158 MinPoppedTagLocality=1 MinPoppedTagId=2
Both remote tlogs are similar in versions. Checking remote SS
147.664701 StorageMetrics ID=ccab656c077b9888 BytesInput=6.29613e+06 8.43277e+06 34352914 BytesDurable=1.84661e+06 8.52027e+06 9234278 11297.664701 StorageMetrics ID=ccab656c077b9888 BytesInput=0 -1 34352914 BytesDurable=0 -1 27625904
updateStorage()
stuck at wait(data->desiredOldestVersion.whenAtLeast(data->storageVersion() + 1));
150.699907 StorageServerDurable0 ID=ccab656c077b9888 DesiredOldestVersion=124995632 StorageVersion=124995632 LR,SS,TL
Last update()
loop was at:
146.839289 StorageServerUpdated ID=ccab656c077b9888 Ver=125995632 DataVersion=125995632 LastTLogVersion=125995632 NewOldest=124236966 DesiredOldest=124978220 MaxVersionInMemory=1000000 Proposed=124995632 PrimaryLocality=-99 Tag=1:1 LR,SS,TL
And then stuck at wait(cursor->getMore());
.
35999.469845 TLogMetrics ID=c390930570cbb5e4 Elapsed=5 BytesInput=0 -1 64769121 BytesDurable=0 -1 64769121 BlockingPeeks=0 -1 0 BlockingPeekTimeouts=0 -1 0 EmptyPeeks=0 -1 0 NonEmptyPeeks=0 -1 0 Version=35885844132 QueueCommittedVersion=35885844132 PersistentDataVersion=35885844132 PersistentDataDurableVersion=35885844132 KnownCommittedVersion=35884882437 QueuePoppedVersion=125931440 PersistentDataDurableVersion=35885844132 >>> QueuePoppedVersion=125931440
cmr | egrep 'DDRecruiting|cc45321354cea0ab|aef4e032dcc57d5b' | s | less
consistency check timeout, SS recruiting DDRecruiting ID=0000000000000000 Primary=1 State=Sending request to worker WorkerID=cc45321354cea0ab AttemptedDoubleRecruitment
99.480836 2.0.1.0:1 ID=32bbbe1c29d98b97 11 accepting_commits StoreType=ssd-redwood-1-experimental RecoveryDuration=1.01329 105.149200 2.1.1.0:1 ID=462f68f756db43e9 0 reading_coordinated_state 105.154920 2.1.1.0:1 ID=462f68f756db43e9 1 locking_coordinated_state TLogs=2 ActiveGenerations=5 MyRecoveryCount=17 ForceRecovery=0 110.117912 2.1.1.0:1 ID=4a3464d668fb5415 0 reading_coordinated_state 110.133373 2.1.1.0:1 ID=4a3464d668fb5415 1 locking_coordinated_state TLogs=2 ActiveGenerations=5 MyRecoveryCount=18 ForceRecovery=1 110.158240 2.1.1.0:1 ID=4a3464d668fb5415 3 reading_transaction_system_state 110.161464 2.1.1.0:1 ID=4a3464d668fb5415 7 recruiting_transaction_servers Conf={"backup_worker_enabled":0,"blob_granules_enabled":0,"log_spill":2,"log_version":7,"logs":3,"perpetual_storage_wiggle":1,"perpetual_storage_wiggle_locality":"0","redundancy_mode":"single","regions":[{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1},{"id":"5","priority":1,"satellite":1}],"satellite_logs":4,"satellite_redundancy_mode":"two_satellite_fast"},{"datacenters":[{"id":"0","priority":-1},{"id":"2","priority":1,"satellite":1},{"id":"4","priority":1,"satelli... RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1 110.161464 2.1.1.0:1 ID=4a3464d668fb5415 8 initializing_transaction_servers CommitProxies=2 GrvProxies=1 TLogs=2 Resolvers=1 SatelliteTLogs=0 OldLogRouters=1 StorageServers=0 BackupWorkers=0 PrimaryDcIds=1 RemoteDcIds=0 110.188493 2.1.1.0:1 ID=4a3464d668fb5415 9 recovery_transaction PrimaryLocality=-1 DcId=1 ClusterId=0d09ed8e9fc97fc2 110.306383 2.1.1.0:1 ID=4a3464d668fb5415 10 writing_coordinated_state TLogList=0: f20d2c2ba19f3c33097149b9e3da80fc, 66fadcdbed98c30194761c0a7007e0b2 110.432635 2.1.1.0:1 ID=4a3464d668fb5415 11 accepting_commits StoreType=ssd-redwood-1-experimental RecoveryDuration=0.314723 110.432635 2.1.1.0:1 ID=4a3464d668fb5415 12 all_logs_recruited 117.123451 2.1.1.0:1 ID=4a3464d668fb5415 14 fully_recovered ClusterId=0d09ed8e9fc97fc2 134.006397 2.1.1.0:1 ID=975b00cf2581fdb3 0 reading_coordinated_state 134.006597 2.1.1.0:1 ID=975b00cf2581fdb3 1 locking_coordinated_state TLogs=1 ActiveGenerations=1 MyRecoveryCount=20 ForceRecovery=0 134.494587 2.1.1.0:1 ID=975b00cf2581fdb3 3 reading_transaction_system_state 134.499824 2.1.1.0:1 ID=975b00cf2581fdb3 7 recruiting_transaction_servers Conf={"backup_worker_enabled":0,"blob_granules_enabled":0,"log_spill":2,"log_version":7,"logs":3,"perpetual_storage_wiggle":0,"perpetual_storage_wiggle_locality":"0","redundancy_mode":"single","regions":[{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1},{"id":"5","priority":1,"satellite":1}],"satellite_logs":4,"satellite_redundancy_mode":"two_satellite_fast"},{"datacenters":[{"id":"0","priority":-1},{"id":"2","priority":1,"satellite":1},{"id":"4","priority":1,"satelli... RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1 134.499824 2.1.1.0:1 ID=975b00cf2581fdb3 8 initializing_transaction_servers CommitProxies=2 GrvProxies=1 TLogs=2 Resolvers=1 SatelliteTLogs=0 OldLogRouters=0 StorageServers=0 BackupWorkers=0 PrimaryDcIds=1 RemoteDcIds=0 134.509373 2.1.1.0:1 ID=975b00cf2581fdb3 9 recovery_transaction PrimaryLocality=-1 DcId=1 ClusterId=0d09ed8e9fc97fc2 134.630719 2.1.1.0:1 ID=975b00cf2581fdb3 10 writing_coordinated_state TLogList=0: 8e5b0d45bc35bc700ed94555c8585959, 25c4e1d245dcd074e19a66a7e0225d6a 135.213083 2.1.1.0:1 ID=975b00cf2581fdb3 11 accepting_commits StoreType=ssd-redwood-1-experimental RecoveryDuration=1.20669 135.213083 2.1.1.0:1 ID=975b00cf2581fdb3 12 all_logs_recruited 150.285496 2.1.1.0:1 ID=975b00cf2581fdb3 14 fully_recovered ClusterId=0d09ed8e9fc97fc2
117.709828 AttemptedDoubleRecruitment ID=cc45321354cea0ab ForRole=StorageServer CC,CD,CP,RV,SS,TL
87.550635 ServerTagRemove ID=0000000000000000 PopVersion=268076943 Tag=1:0 Server=aef4e032dcc57d5b CC,CP,RV,SS,TL
87.561902 SSPrivateMutation ID=aef4e032dcc57d5b Mutation=code: ClearRange param1: \xff\xff/serverTag/[}\xc5\xdc2\xe0\xf4\xae\x842\x04K\xaev\xff\xee param2: \xff\xff/serverTag/[}\xc5\xdc2\xe0\xf4\xae\x842\x04K\xaev\xff\xee\x00 Version=268076942 CD,LR,SS,TL
87.561902 StorageServerWorkerRemoved ID=aef4e032dcc57d5b Reason=ServerTag MutationType=ClearRange TagMatches=1 IsTSS=0 CD,LR,SS,TL
87.561902 RedwoodShutdown ID=0000000000000000 Filename=/root/tmp/simfdb/21e91cf55df38898163ba08045adc0ee/storage-aef4e032dcc57d5beeff76ae4b043284.redwood-v1 Dispose=1 CD,LR,SS,TL
87.561902 StorageServerTerminated ID=aef4e032dcc57d5b Error=worker_removed ErrorDescription=Normal worker shut down ErrorCode=1202 CD,LR,SS,TL
87.561902 WorkerTerminatingByIOError ID=aef4e032dcc57d5b Error=operation_cancelled ErrorDescription=Asynchronous operation cancelled ErrorCode=1101 CD,LR,SS,TL
87.561902 TrackingRunningStorageDesctruction ID=0000000000000000 StorageID=aef4e032dcc57d5b CD,LR,SS,TL
87.561902 Role ID=aef4e032dcc57d5b Error=operation_cancelled ErrorDescription=Asynchronous operation cancelled ErrorCode=1101 Transition=End As=StorageServer Reason=Error CD,LR,SS,TL
87.568128 RemovedStorageServer ID=95624f251a52ee81 ServerID=aef4e032dcc57d5b CP,DD,GP,LR,MS,RK,SS,TL
87.131736 2.0.1.1:1 ID=284572c1d78bcf3f 9 recovery_transaction PrimaryLocality=0 DcId=0 ClusterId=0d09ed8e9fc97fc2 87.262648 2.0.1.1:1 ID=284572c1d78bcf3f 10 writing_coordinated_state TLogList=0: 43e2d1783394b9c05c49b34220ece7b0, 4b69fb14aa4808ace50a408f40a2b41e 1: 46f4d4c7eaee6aba7d50102f6fed9140, 91757d5df86bbaac85d459485356f1e2, e35545082a4af66df503077d18f934a6, 7f80c00d350a02cd079624ec0df20848 2: 433e3701ada24904c3106c6fe1a6de94, c899195a4abb4dc0c77304b02d04a9c1 87.303309 2.0.1.1:1 ID=284572c1d78bcf3f 11 accepting_commits StoreType=ssd-redwood-1-experimental RecoveryDuration=0.290633 87.303309 2.0.1.1:1 ID=284572c1d78bcf3f 12 all_logs_recruited 88.543115 2.0.1.1:1 ID=f110ae1175eab357 0 reading_coordinated_state 88.564911 2.0.1.1:1 ID=f110ae1175eab357 1 locking_coordinated_state TLogs=3 ActiveGenerations=2 MyRecoveryCount=9 ForceRecovery=0 88.596852 2.0.1.1:1 ID=f110ae1175eab357 3 reading_transaction_system_state 88.602874 2.0.1.1:1 ID=f110ae1175eab357 7 recruiting_transaction_servers Conf={"backup_worker_enabled":0,"blob_granules_enabled":0,"log_spill":2,"log_version":7,"logs":3,"perpetual_storage_wiggle":1,"perpetual_storage_wiggle_locality":"0","redundancy_mode":"single","regions":[{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1},{"id":"5","priority":1,"satellite":1}],"satellite_logs":4,"satellite_redundancy_mode":"two_satellite_fast"},{"datacenters":[{"id":"0","priority":-1},{"id":"2","priority":1,"satellite":1},{"id":"4","priority":1,"satelli... RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1 89.099797 2.1.1.1:1 ID=03cb70a7a5bce939 0 reading_coordinated_state 89.122679 2.1.1.1:1 ID=03cb70a7a5bce939 1 locking_coordinated_state TLogs=3 ActiveGenerations=2 MyRecoveryCount=10 ForceRecovery=0 89.153685 2.1.1.1:1 ID=03cb70a7a5bce939 3 reading_transaction_system_state 89.548237 2.0.1.1:1 ID=f19e17e0a689447d 0 reading_coordinated_state 90.157984 2.1.1.1:1 ID=03cb70a7a5bce939 7 recruiting_transaction_servers Conf={"backup_worker_enabled":0,"blob_granules_enabled":0,"log_spill":2,"log_version":7,"logs":3,"perpetual_storage_wiggle":1,"perpetual_storage_wiggle_locality":"0","redundancy_mode":"single","regions":[{"datacenters":[{"id":"1","priority":1},{"id":"3","priority":1,"satellite":1},{"id":"5","priority":1,"satellite":1}],"satellite_logs":4,"satellite_redundancy_mode":"two_satellite_fast"},{"datacenters":[{"id":"0","priority":-1},{"id":"2","priority":1,"satellite":1},{"id":"4","priority":1,"satelli... RequiredCommitProxies=1 RequiredGrvProxies=1 RequiredResolvers=1 90.612984 2.1.1.1:1 ID=03cb70a7a5bce939 8 initializing_transaction_servers CommitProxies=2 GrvProxies=1 TLogs=2 Resolvers=1 SatelliteTLogs=4 OldLogRouters=1 StorageServers=0 BackupWorkers=0 PrimaryDcIds=1 RemoteDcIds=0 91.058777 2.1.1.1:1 ID=03cb70a7a5bce939 9 recovery_transaction PrimaryLocality=1 DcId=1 ClusterId=0d09ed8e9fc97fc2 91.193007 2.1.1.1:1 ID=03cb70a7a5bce939 10 writing_coordinated_state TLogList=0: 72e352fcdd623a08e0cd5e9a752a4c56, 75afcc339217a3210eedb2d6adf4b093 1: 958dce0865c18b40a7a363bda17edb69, 40b0bb960f0d1f318edc69b6fc17755a, be7bcb3bc2d6f037721de14b88d2efbc, 4d8544a0b50bc0397626f49f08a0f6f1 91.225796 2.1.1.1:1 ID=03cb70a7a5bce939 11 accepting_commits StoreType=ssd-redwood-1-experimental RecoveryDuration=2.126
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/RyowCorrectness.toml -s 23106879 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/SwizzledRollbackSideband.toml -s 120488561 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_7.0.0/SnapTestAttrition-1.txt -s 374687110 -b on
commit: 01234720e7ce2ee117603964a66d68ab1ffc28dc ./build_output/bin/fdbserver -r simulation --crash --logsize 1024MB -f foundationdb/tests/fast/FuzzApiCorrectness.toml -s 105378906 -b on --crash
Time="21086.505953" Type="ConsistencyCheck_IncorrectEstimate" Machine="3.4.3.3:1" ID="0000000000000000" EstimatedBytes="2525750" CorrectSampledBytes="450389636" StorageServer="738df423f9b09e78"