Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Issue: Testnet node getting stuck at block sync after restart #12976

Open
evgenykuzyakov opened this issue Feb 22, 2025 · 3 comments
Open
Assignees
Labels
community Issues created by community investigation required Node Node team

Comments

@evgenykuzyakov
Copy link
Collaborator

evgenykuzyakov commented Feb 22, 2025

Contact Details

[email protected]

Node type

RPC

Which network are you running?

testnet

What happened?

There were 4 of testnet nodes running 2.5.0-rc.1 affected. Note it may be different issues on them.

The nodes fail to do block sync getting stuck at some blocks.

Main issue is the snapshot producing node failed to sync up after the last snapshot was uploaded.
The node fails to sync up with the following error message:

Feb 22 13:30:44 node-ft01 nearcore[263250]: 2025-02-22T13:30:44.513565Z  WARN chain: Error in applying chunk for block shard_id=9 hash=J4yP1KAL57RvCnCbsyjn1jtmj636cxfEePzooxkUmVTs err=Storage Error: MissingTrieValue(TrieStorage, XMuesBVj3SqcHXSrWax4Ca6TjKivHpMgUwXdjcwPGJU)
Feb 22 13:30:44 node-ft01 nearcore[263250]: 2025-02-22T13:30:44.513630Z ERROR client: try_process_unfinished_blocks got errors errors={J4yP1KAL57RvCnCbsyjn1jtmj636cxfEePzooxkUmVTs: StorageError(MissingTrieValue(TrieStorage, XMuesBVj3SqcHXSrWax4Ca6TjKivHpMgUwXdjcwPGJU))}

Reproduce steps

It's easy to reproduce the state using the snapshot from block 188313698 (details https://docs.fastnear.com/docs/snapshots#rpc-testnet-snapshot):

# Latest rclone
sudo -v ; curl https://rclone.org/install.sh | sudo bash
# Will download the snapshot into the `~/.near/data`
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/fastnear/static/refs/heads/main/down_rclone.sh | DATA_PATH=~/.near/data CHAIN_ID=testnet BLOCK=188313698 bash

Version

neard (release 2.5.0-rc.3) (build 2.5.0-rc.3) (rustc 1.84.0) (protocol 76) (db 43)
features: [default, json_rpc, rosetta_rpc]

Relevant log output

Feb 22 13:30:44 node-ft01 nearcore[263250]: 2025-02-22T13:30:44.513565Z  WARN chain: Error in applying chunk for block shard_id=9 hash=J4yP1KAL57RvCnCbsyjn1jtmj636cxfEePzooxkUmVTs err=Storage Error: MissingTrieValue(TrieStorage, XMuesBVj3SqcHXSrWax4Ca6TjKivHpMgUwXdjcwPGJU)
Feb 22 13:30:44 node-ft01 nearcore[263250]: 2025-02-22T13:30:44.513630Z ERROR client: try_process_unfinished_blocks got errors errors={J4yP1KAL57RvCnCbsyjn1jtmj636cxfEePzooxkUmVTs: StorageError(MissingTrieValue(TrieStorage, XMuesBVj3SqcHXSrWax4Ca6TjKivHpMgUwXdjcwPGJU))}
Feb 22 13:30:45 node-ft01 nearcore[263250]: 2025-02-22T13:30:45.011613Z  INFO stats: #188313698 Downloading blocks 0.00% (44 left; at 188313698) 42 peers ⬇ 504 kB/s ⬆ 666 kB/s 0.00 bps 0 gas/s CPU: 197%, Mem: 5.62 GB
Feb 22 13:30:45 node-ft01 nearcore[263250]: 2025-02-22T13:30:45.684549Z ERROR network: Failed to store connection attempt. peer_info=PeerInfo { id: ed25519:DeRyxMeaSfDC6MeNFMDmHS4tshnYN8VMyCqoybqbUV4g, addr: Some(34.29.37.230:24567), account_id: None }
Feb 22 13:30:46 node-ft01 nearcore[263250]: 2025-02-22T13:30:46.495872Z  WARN chain: Error in applying chunk for block shard_id=9 hash=J4yP1KAL57RvCnCbsyjn1jtmj636cxfEePzooxkUmVTs err=Storage Error: MissingTrieValue(TrieStorage, XMuesBVj3SqcHXSrWax4Ca6TjKivHpMgUwXdjcwPGJU)
Feb 22 13:30:46 node-ft01 nearcore[263250]: 2025-02-22T13:30:46.495954Z ERROR client: try_process_unfinished_blocks got errors errors={J4yP1KAL57RvCnCbsyjn1jtmj636cxfEePzooxkUmVTs: StorageError(MissingTrieValue(TrieStorage, XMuesBVj3SqcHXSrWax4Ca6TjKivHpMgUwXdjcwPGJU))}

Node head info

Node upgrade history

Was running 2.5.0-rc.1 before the protocol upgrade.

DB reset history

The node was producing snapshots for FastNear testnet.
@VanBarbascu
Copy link
Contributor

Thanks for reporting this! The team is aware of the issue and we will come back tomorrow with the mitigation steps.

@VanBarbascu
Copy link
Contributor

The team narrowed down the issue to garbage collection on the newly created shard (9). This problem only occurs on rpc nodes.

We are working on the fix so that it does not happen in future reshardings.

If you are affected by this, you can mitigate the issue by getting the good state by either Epoch Sync or restore the node from one of the latest FastNEAR snapshots.

Thanks @evgenykuzyakov for being proactive in facilitating the recovery!

Epoch Sync:

./neard init --download-config rpc --chain-id testnet  --download-genesis

curl -X POST https://rpc.testnet.near.org   -H "Content-Type: application/json"   -d '{
            "jsonrpc": "2.0",
            "method": "network_info",
            "params": [],
            "id": "dontcare"
          }'| jq -r '.result.active_peers[]  as $active_peer  | "\($active_peer.id)@\($active_peer.addr)"' |paste -sd',' -
# put the output in configs.

./neard run

Snapshot instructions can be found here.

@evgenykuzyakov
Copy link
Collaborator Author

evgenykuzyakov commented Feb 23, 2025

Archival node got affected as well. Not sure how to recover this yet. I assume I should be able to download older hot snapshots from regular RPC and then sync properly once the GC issue is fixed.

EDIT: Solved that using latest RPC snapshot. Converted it to Hot. Advanced it a bit to have HEAD later than previous cold-data HEAD. Than it was able to sync up. Will upload a new archive snapshot now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Issues created by community investigation required Node Node team
Projects
None yet
Development

No branches or pull requests

2 participants