You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed that sometimes after one of our RPC nodes crashes, specifically after panicking from the error specified in this issue #5070 the RPC node will intermittently return the "ok" healthy response for the getHealth RPC method once it starts catching back up after crashing. This happens even though it is still not caught up and many times 2500+ slots behind. I have provided a screenshot below of running both the solana catchup command and the getHealth RPC method against the affected RPC node.
However I have found after some time, it will return the correct response indicating it is not caught up yet and thus "unhealthy". Additionally I have provided a graph with labels depicting the cycle. The getHealth method is checked every 2 seconds across all our RPC nodes and is used for load balancing reasons.
In the above image, 1 indicates it received a healthy response from the RPC node and 0 indicates it is either offline or received anything other than a "ok" response from the health check. After some time 5-10 minutes it realizes it is not healthy and returns the correct response until it is all the way caught back up.
Thanks for pointing that out. Yeah it seems very similar to the problem I am experiencing. I may try to pull the proposed changes and test it out if it ends up being a while before it gets merged into a release.
Problem
I have noticed that sometimes after one of our RPC nodes crashes, specifically after panicking from the error specified in this issue #5070 the RPC node will intermittently return the "ok" healthy response for the
getHealth
RPC method once it starts catching back up after crashing. This happens even though it is still not caught up and many times 2500+ slots behind. I have provided a screenshot below of running both thesolana catchup
command and thegetHealth
RPC method against the affected RPC node.However I have found after some time, it will return the correct response indicating it is not caught up yet and thus "unhealthy". Additionally I have provided a graph with labels depicting the cycle. The
getHealth
method is checked every 2 seconds across all our RPC nodes and is used for load balancing reasons.In the above image, 1 indicates it received a healthy response from the RPC node and 0 indicates it is either offline or received anything other than a "ok" response from the health check. After some time 5-10 minutes it realizes it is not healthy and returns the correct response until it is all the way caught back up.
Version Information:
Startup Arguments:
Proposed Solution
I do not have a solution to this. I will try and look into this some more if no one else has any ideas.
The text was updated successfully, but these errors were encountered: