Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden gaps in turbine load #4965

Open
alexpyattaev opened this issue Feb 13, 2025 · 6 comments
Open

Sudden gaps in turbine load #4965

alexpyattaev opened this issue Feb 13, 2025 · 6 comments

Comments

@alexpyattaev
Copy link

alexpyattaev commented Feb 13, 2025

Problem

The load on turbine retransmit stage has a tendency to get 1-2 minute long "gaps" like this:
Image

Image

This happens on both testnet and mainnet.

  • If a given validator is not participating in forwarding of shreds for > minute, this is probably not just by random chance.
  • The load does not go to zero (i.e. it is likely not network issue)
  • CPU load on the validator was not very high at the moment
@t-nelson
Copy link

Duration smells like vote lockout. Though a vote lockout shouldn’t really influence retransmit. Check logs around that time for something like “waiting to switch forks”

@alexpyattaev
Copy link
Author

rg "Waiting for switch fork to make block" logs/solana-validator.log
turns out empty.

@bw-solana
Copy link

Cluster is partitioning --> stop voting --> blocks get small --> retransmit shred count drops
Image

Need to understand what is causing the partitioning and why it takes so long to come back together.

@bw-solana
Copy link

Image

Looks like we're stuck until refresh. Notice at the end of every bathtub, there is a spike in refresh votes and long partition resolution count.

Reducing refresh interval should fix this. @AshwinSekar is cooking

@alexpyattaev
Copy link
Author

Just checked how much latency we are adding per shred during

  • broadcast: 20-90 ms for the entire block depending on block size
  • retransmit: 500 us on every hop

So it would seem that we are moving shreds as fast as we realistically can, or at least close to it

@AshwinSekar
Copy link

Reducing refresh interval should fix this. @AshwinSekar is cooking

Refresh interval reduction landed in v2.2. We should still diagnose the issue on why the votes are failing to propagate in the first place:

  • Leader selection is busted when there is more than 1 fork. We only target the next leader to send our vote to and if they are on the other fork then our vote is thrown out.
  • Gossip should fix this, but our vote is getting lost somewhere, either not getting "out the door" or getting lost at the leaders ingestion. I have no reason to assume gossip is busted, so I would bet that something is wrong in the leader ingestion, however the fact that refreshed votes land indicates otherwise. 🤔

I'm in favor of backporting the reduction to v2.1 #4608, however we should make sure to solve this properly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants