-
Notifications
You must be signed in to change notification settings - Fork 0
7.1.2 OOM
OOM: due to reading disk files (*.fdq are disk queues, *.sqlite are for spilling)
Data points:
- Total 2706 files, ~73GB in size
- Many traces of
HugeArenaSample
point to reading disk queue (DiskQueue::readNext
) - TLog creates one actor for each file, and reading them in parallel
No RecoveryDelayedTooManyOldGenerations
event, so < 75 total recoveries.
Shared TLog should reuse the same file. Q: after reboot, is the same disk queue file used?
Q: Why so many queues and sqlite files?
Q: Why many recruitments happened? Does CC repeatedly recruit TLogs?
Q: Does each TLog recovery create new files? Yes Shared TLog is one file, but could be multiple files when configure
Q: Any data corruption due to serialization, alters TLog data?
SQLite : 2GB cache disk queue: read all files. 2700 actors read all files
Approaches:
- Test on another cluster 7.1.0rc3 -> 7.1.2 upgrade path
- Copies file out, use customized
fdbserver
binary to inspect the content - If only one tlog has issue, exclude that one
Severity="40" ErrorKind="Unset" Time="1650658233.310241" DateTime="2022-04-22T20:10:33Z" Type="OutOfMemory" ID="0000000000000000" Message="Out of memory" ThreadID="9529080063715037305" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x35f4abc 0x35f3740 0x35f3b3b 0x35be9fc 0x35bea2c 0x35d4cd5 0x11981f0 0x358f7c2 0xa420a9 0x7faf66d0e555" Machine="10.121.155.32:4703"
Event Severity="10" Time="1650658233.306716" DateTime="2022-04-22T20:10:33Z" Type="TLogPersistentStateRestore" ID="20a05e80a6d1232c" LogId="08c48f9c782166d8" Ver="182955804967008" RecoveryCount="5616" ThreadID="9529080063715037305" Machine="10.121.155.32:4703" LogGroup="playstation_prodfdbserver_p01" Roles="TL"
Machine Roles count
10.121.131.29:4703 TL 448
10.121.134.92:4703 TL 439
10.121.154.20:4703 TL 435
10.121.155.32:4703 TL 423
10.77.1.4:4703 TL 448
10.77.140.4:4703 TL 438
10.77.150.15:4703 TL 456
10.77.151.74:4703 TL 443
10.79.49.20:4689 CD,TL 439
10.79.58.5:4689 TL 406
10.79.62.13:4689 CD,TL 419
10.79.62.13:4690 TL 445