Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC method getHealth is incorrectly returning healthy "ok" response #5071

Open
SVS-bigj opened this issue Feb 25, 2025 · 2 comments
Open

RPC method getHealth is incorrectly returning healthy "ok" response #5071

SVS-bigj opened this issue Feb 25, 2025 · 2 comments

Comments

@SVS-bigj
Copy link

Problem

I have noticed that sometimes after one of our RPC nodes crashes, specifically after panicking from the error specified in this issue #5070 the RPC node will intermittently return the "ok" healthy response for the getHealth RPC method once it starts catching back up after crashing. This happens even though it is still not caught up and many times 2500+ slots behind. I have provided a screenshot below of running both the solana catchup command and the getHealth RPC method against the affected RPC node.

Image

However I have found after some time, it will return the correct response indicating it is not caught up yet and thus "unhealthy". Additionally I have provided a graph with labels depicting the cycle. The getHealth method is checked every 2 seconds across all our RPC nodes and is used for load balancing reasons.

Image

In the above image, 1 indicates it received a healthy response from the RPC node and 0 indicates it is either offline or received anything other than a "ok" response from the health check. After some time 5-10 minutes it realizes it is not healthy and returns the correct response until it is all the way caught back up.

Version Information:

  • v2.1.11-jito
  • Yellowstone gRPC Geyser plugin v5.0.0+solana.2.1.11

Startup Arguments:

agave-validator \
  --ledger /var/solana/data/ledger \
  --accounts /var/solana/accounts \
  --identity /var/solana/data/config/validator-keypair.json \
  --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
  --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
  --known-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
  --known-validator HyperSPG8w4jgdHgmA8ExrhRL1L1BriRTHD9UFdXJUud \
  --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
  --expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \
  --entrypoint entrypoint.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
  --no-voting \
  --only-known-rpc \
  --log /home/solana/validator.log \
  --rpc-port 8899 \
  --dynamic-port-range 8000-8100 \
  --init-complete-file /var/solana/data/init-completed \
  --limit-ledger-size  100000000 \
  --wal-recovery-mode skip_any_corrupted_record \
  --full-rpc-api \
  --enable-rpc-transaction-history \
  --enable-cpi-and-log-storage \
  --account-index program-id \
  --account-index spl-token-owner \
  --account-index spl-token-mint \
  --rpc-bind-address 10.10.5.2 \
  --rpc-send-leader-count 2 \
  --private-rpc \
  --rpc-threads 48 \
  --geyser-plugin-config /home/solana/bin/yellowstone-grpc-config.json \
  --minimal-snapshot-download-speed 50485760 \
  --rpc-send-service-max-retries 10 \
  --block-verification-method unified-scheduler \
  --unified-scheduler-handler-threads 8 \
  --health-check-slot-distance 25

Proposed Solution

I do not have a solution to this. I will try and look into this some more if no one else has any ideas.

@steviez
Copy link

steviez commented Feb 26, 2025

I have not yet had the chance to review, but this sounds like it might be the same issue as discussed here: #5042

@SVS-bigj
Copy link
Author

Thanks for pointing that out. Yeah it seems very similar to the problem I am experiencing. I may try to pull the proposed changes and test it out if it ends up being a while before it gets merged into a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants