Sudden gaps in turbine load #4965

alexpyattaev · 2025-02-13T12:13:32Z

Problem

The load on turbine retransmit stage has a tendency to get 1-2 minute long "gaps" like this:

This happens on both testnet and mainnet.

If a given validator is not participating in forwarding of shreds for > minute, this is probably not just by random chance.
The load does not go to zero (i.e. it is likely not network issue)
CPU load on the validator was not very high at the moment

t-nelson · 2025-02-13T14:10:09Z

Duration smells like vote lockout. Though a vote lockout shouldn’t really influence retransmit. Check logs around that time for something like “waiting to switch forks”

alexpyattaev · 2025-02-13T14:47:02Z

rg "Waiting for switch fork to make block" logs/solana-validator.log
turns out empty.

bw-solana · 2025-02-13T15:36:07Z

Cluster is partitioning --> stop voting --> blocks get small --> retransmit shred count drops

Need to understand what is causing the partitioning and why it takes so long to come back together.

bw-solana · 2025-02-13T15:39:28Z

Looks like we're stuck until refresh. Notice at the end of every bathtub, there is a spike in refresh votes and long partition resolution count.

Reducing refresh interval should fix this. @AshwinSekar is cooking

alexpyattaev · 2025-02-13T21:10:14Z

Just checked how much latency we are adding per shred during

broadcast: 20-90 ms for the entire block depending on block size
retransmit: 500 us on every hop

So it would seem that we are moving shreds as fast as we realistically can, or at least close to it

AshwinSekar · 2025-02-13T23:56:48Z

Reducing refresh interval should fix this. @AshwinSekar is cooking

Refresh interval reduction landed in v2.2. We should still diagnose the issue on why the votes are failing to propagate in the first place:

Leader selection is busted when there is more than 1 fork. We only target the next leader to send our vote to and if they are on the other fork then our vote is thrown out.
Gossip should fix this, but our vote is getting lost somewhere, either not getting "out the door" or getting lost at the leaders ingestion. I have no reason to assume gossip is busted, so I would bet that something is wrong in the leader ingestion, however the fact that refreshed votes land indicates otherwise. 🤔

I'm in favor of backporting the reduction to v2.1 #4608, however we should make sure to solve this properly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden gaps in turbine load #4965

Sudden gaps in turbine load #4965

alexpyattaev commented Feb 13, 2025 •

edited

Loading

t-nelson commented Feb 13, 2025

alexpyattaev commented Feb 13, 2025

bw-solana commented Feb 13, 2025

bw-solana commented Feb 13, 2025

alexpyattaev commented Feb 13, 2025

AshwinSekar commented Feb 13, 2025

Sudden gaps in turbine load #4965

Sudden gaps in turbine load #4965

Comments

alexpyattaev commented Feb 13, 2025 • edited Loading

Problem

t-nelson commented Feb 13, 2025

alexpyattaev commented Feb 13, 2025

bw-solana commented Feb 13, 2025

bw-solana commented Feb 13, 2025

alexpyattaev commented Feb 13, 2025

AshwinSekar commented Feb 13, 2025

alexpyattaev commented Feb 13, 2025 •

edited

Loading