Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGV in rocksdb during validator startup #4983

Open
alexpyattaev opened this issue Feb 14, 2025 · 4 comments
Open

SEGV in rocksdb during validator startup #4983

alexpyattaev opened this issue Feb 14, 2025 · 4 comments

Comments

@alexpyattaev
Copy link

alexpyattaev commented Feb 14, 2025

Problem

Caught SEGV somewhere in rocksdb arena allocator. Unlikely to be an OOM as node had > 400 GB of free memory at the time of crash.
Validator ID DmCowGH9DUHYCetfGaWzPzYCi455yDxewcycdyWuPLjx was reporting metrics to solana.metrics.com, and appears to have crashed immediately after fetching the snapshot.

running on commit 0975a9f

Backtrace below:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `agave-validator --dynamic-port-range 8002-8020 --gossip-port 8001 --identity /h'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005c342f87ff91 in rocksdb::ConcurrentArena::Repick() ()
[Current thread is 1 (Thread 0x7a2779e006c0 (LWP 4395))]
(gdb) bt
#0  0x00005c342f87ff91 in rocksdb::ConcurrentArena::Repick() ()
#1  0x00005c342f7a7e0c in char* rocksdb::ConcurrentArena::AllocateImpl<rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1}>(unsigned long, bool, rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1} const&) ()
#2  0x00005c342f7a78d6 in rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*) ()
#3  0x00005c342f8830cb in rocksdb::(anonymous namespace)::SkipListRep::Allocate(unsigned long, char**) ()
#4  0x00005c342f7a3338 in rocksdb::MemTable::Add(unsigned long, rocksdb::ValueType, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::ProtectionInfoKVOS<unsigned long> const*, bool, rocksdb::MemTablePostProcessInfo*, void**) ()
#5  0x00005c342f82a5f4 in rocksdb::(anonymous namespace)::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::ValueType, rocksdb::ProtectionInfoKVOS<unsigned long> const*) ()
#6  0x00005c342f827432 in rocksdb::(anonymous namespace)::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&) ()
#7  0x00005c342f82046b in rocksdb::WriteBatchInternal::Iterate(rocksdb::WriteBatch const*, rocksdb::WriteBatch::Handler*, unsigned long, unsigned long) ()
#8  0x00005c342f826594 in rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::Writer*, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, unsigned long, bool, bool) ()
#9  0x00005c342f749fba in rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*, rocksdb::PostMemTableCallback*) ()
#10 0x00005c342f7495c6 in rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) ()
#11 0x00005c342f6a6934 in rocksdb_write ()
#12 0x00005c342e80d2d7 in rocksdb::db::DBCommon<T,rocksdb::db::DBWithThreadModeInner>::write ()
#13 0x00005c342e96c5af in solana_ledger::blockstore_db::Rocks::write ()
#14 0x00005c342eeab43c in solana_rpc::transaction_status_service::TransactionStatusService::write_transaction_status_batch ()
#15 0x00005c342ee2d17a in <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute ()
#16 0x00005c342d5cf302 in rayon_core::registry::WorkerThread::wait_until_cold ()
#17 0x00005c342dcd1994 in rayon_core::registry::ThreadBuilder::run ()
#18 0x00005c342dcd748a in std::sys::backtrace::__rust_begin_short_backtrace ()
#19 0x00005c342dcd55a8 in core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#20 0x00005c342f4df1bb in std::sys::pal::unix::thread::Thread::new::thread_start ()
#21 0x00007a28dfe9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#22 0x00007a28dff29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

@steviez
Copy link

steviez commented Feb 14, 2025

and appears to have crashed immediately after fetching the snapshot.

Given the below line in the backtrace, I think your validator succeeded in unpacking snapshot/rebuilding the Bank and getting through the rest of startup. TransactionStatusService receives work from ReplayStage when transactions are being replayed:

#14 0x00005c342eeab43c in solana_rpc::transaction_status_service::TransactionStatusService::write_transaction_status_batch ()

You can determine this from either logs or metrics by looking at the validator-new datapoint. This line is emitted right at the end of Validator::new()

One thing worth noting is that given your commit, you were still running with multi-threaded TSS; we backed that change out in #4875 in order to do some refactoring. So, the change you were running with is no longer present in master. However, we will likely be trying to reintroduce that functionality soon (CC @fkouteib)

Given that we will likely try to reintroduce that change (or at least something similar), I am somewhat curious as to the root cause of this. In Discord, you mentioned you had a core dump; you could poke around in there but I'm not really sure what I'd be looking for. Digging up the source in rocksdb could be helpful too, but again, this seems pretty open ended at the moment

@alexpyattaev
Copy link
Author

The validator indeed got through loading ledger and then crashed.
The crash reproduced reliably on that particular commit.
Updating to latest master fixed the issue.
So I guess your theory about the cause is correct. Just in case I have the crash dump saved if you want to look around.

@steviez
Copy link

steviez commented Feb 14, 2025

The crash reproduced reliably on that particular commit.

Hmm, can you share what args your validator was running with ? Might be of interest for TransactionStatusService take 2

@alexpyattaev
Copy link
Author

agave-validator --dynamic-port-range 8002-8020 --gossip-port 8001 --identity /home/sol/identity/id.json --ledger /home/sol/ledger --snapshots /home/sol/ledger --limit-ledger-size --log /home/sol/logs/agave-validator.log --rpc-port 8899 --wal-recovery-mode skip_any_corrupted_record --no-voting --trusted-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 --trusted-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ --trusted-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ --trusted-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S --no-untrusted-rpc --expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d --expected-shred-version 50093 --entrypoint entrypoint.mainnet-beta.solana.com:8001 --entrypoint entrypoint2.mainnet-beta.solana.com:8001 --entrypoint entrypoint3.mainnet-beta.solana.com:8001 --entrypoint entrypoint4.mainnet-beta.solana.com:8001 --entrypoint entrypoint5.mainnet-beta.solana.com:8001 --no-genesis-fetch --no-snapshot-fetch --enable-extended-tx-metadata-storage --enable-rpc-transaction-history --full-rpc-api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants