RPC method getHealth is incorrectly returning healthy "ok" response #5071

SVS-bigj · 2025-02-25T21:57:13Z

Problem

I have noticed that sometimes after one of our RPC nodes crashes, specifically after panicking from the error specified in this issue #5070 the RPC node will intermittently return the "ok" healthy response for the getHealth RPC method once it starts catching back up after crashing. This happens even though it is still not caught up and many times 2500+ slots behind. I have provided a screenshot below of running both the solana catchup command and the getHealth RPC method against the affected RPC node.

However I have found after some time, it will return the correct response indicating it is not caught up yet and thus "unhealthy". Additionally I have provided a graph with labels depicting the cycle. The getHealth method is checked every 2 seconds across all our RPC nodes and is used for load balancing reasons.

In the above image, 1 indicates it received a healthy response from the RPC node and 0 indicates it is either offline or received anything other than a "ok" response from the health check. After some time 5-10 minutes it realizes it is not healthy and returns the correct response until it is all the way caught back up.

Version Information:

v2.1.11-jito
Yellowstone gRPC Geyser plugin v5.0.0+solana.2.1.11

Startup Arguments:

agave-validator \
  --ledger /var/solana/data/ledger \
  --accounts /var/solana/accounts \
  --identity /var/solana/data/config/validator-keypair.json \
  --known-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
  --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
  --known-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
  --known-validator HyperSPG8w4jgdHgmA8ExrhRL1L1BriRTHD9UFdXJUud \
  --known-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
  --expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \
  --entrypoint entrypoint.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
  --entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
  --no-voting \
  --only-known-rpc \
  --log /home/solana/validator.log \
  --rpc-port 8899 \
  --dynamic-port-range 8000-8100 \
  --init-complete-file /var/solana/data/init-completed \
  --limit-ledger-size  100000000 \
  --wal-recovery-mode skip_any_corrupted_record \
  --full-rpc-api \
  --enable-rpc-transaction-history \
  --enable-cpi-and-log-storage \
  --account-index program-id \
  --account-index spl-token-owner \
  --account-index spl-token-mint \
  --rpc-bind-address 10.10.5.2 \
  --rpc-send-leader-count 2 \
  --private-rpc \
  --rpc-threads 48 \
  --geyser-plugin-config /home/solana/bin/yellowstone-grpc-config.json \
  --minimal-snapshot-download-speed 50485760 \
  --rpc-send-service-max-retries 10 \
  --block-verification-method unified-scheduler \
  --unified-scheduler-handler-threads 8 \
  --health-check-slot-distance 25

Proposed Solution

I do not have a solution to this. I will try and look into this some more if no one else has any ideas.

The text was updated successfully, but these errors were encountered:

steviez · 2025-02-26T04:15:57Z

I have not yet had the chance to review, but this sounds like it might be the same issue as discussed here: #5042

SVS-bigj · 2025-02-27T22:48:15Z

Thanks for pointing that out. Yeah it seems very similar to the problem I am experiencing. I may try to pull the proposed changes and test it out if it ends up being a while before it gets merged into a release.

SVS-bigj added the community label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC method getHealth is incorrectly returning healthy "ok" response #5071

RPC method getHealth is incorrectly returning healthy "ok" response #5071

SVS-bigj commented Feb 25, 2025

steviez commented Feb 26, 2025

SVS-bigj commented Feb 27, 2025

RPC method getHealth is incorrectly returning healthy "ok" response #5071

RPC method getHealth is incorrectly returning healthy "ok" response #5071

Comments

SVS-bigj commented Feb 25, 2025

Problem

Proposed Solution

steviez commented Feb 26, 2025

SVS-bigj commented Feb 27, 2025