Skip to content
Jingyu Zhou edited this page Apr 23, 2022 · 2 revisions

OOM: due to reading disk files (*.fdq are disk queues, *.sqlite are for spilling)

Data points:

  • Total 2706 files, ~73GB in size
  • Many traces of HugeArenaSample point to reading disk queue (DiskQueue::readNext)
  • TLog creates one actor for each file, and reading them in parallel

No RecoveryDelayedTooManyOldGenerations event, so < 75 total recoveries.

Shared TLog should reuse the same file. Q: after reboot, is the same disk queue file used?

Q: Why so many queues and sqlite files?

Q: Why many recruitments happened? Does CC repeatedly recruit TLogs?

Q: Does each TLog recovery create new files? Yes Shared TLog is one file, but could be multiple files when configure

Q: Any data corruption due to serialization, alters TLog data?

SQLite : 2GB cache disk queue: read all files. 2700 actors read all files

Approaches:

  • Test on another cluster 7.1.0rc3 -> 7.1.2 upgrade path
  • Copies file out, use customized fdbserver binary to inspect the content
  • If only one tlog has issue, exclude that one

Severity="40" ErrorKind="Unset" Time="1650658233.310241" DateTime="2022-04-22T20:10:33Z" Type="OutOfMemory" ID="0000000000000000" Message="Out of memory" ThreadID="9529080063715037305" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x35f4abc 0x35f3740 0x35f3b3b 0x35be9fc 0x35bea2c 0x35d4cd5 0x11981f0 0x358f7c2 0xa420a9 0x7faf66d0e555" Machine="10.121.155.32:4703"

Event Severity="10" Time="1650658233.306716" DateTime="2022-04-22T20:10:33Z" Type="TLogPersistentStateRestore" ID="20a05e80a6d1232c" LogId="08c48f9c782166d8" Ver="182955804967008" RecoveryCount="5616" ThreadID="9529080063715037305" Machine="10.121.155.32:4703" LogGroup="playstation_prodfdbserver_p01" Roles="TL"

https://splunk-if.icloud.apple.com/en-US/app/search/search?sid=1650672821.64238_139BDE67-98A9-4FD7-A5EE-800DC055E245

Machine	            Roles	count
10.121.131.29:4703	TL	448
10.121.134.92:4703	TL	439
10.121.154.20:4703	TL	435
10.121.155.32:4703	TL	423
10.77.1.4:4703	TL	448
10.77.140.4:4703	TL	438
10.77.150.15:4703	TL	456
10.77.151.74:4703	TL	443
10.79.49.20:4689	CD,TL	439
10.79.58.5:4689	TL	406
10.79.62.13:4689	CD,TL	419
10.79.62.13:4690	TL	445