-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perfomance issue: There is sudden bump in CPU utilization, long time to get response when we send requests to /search/transactions end point #303
Comments
While application can be scaled and we will look into an exact performance issue caused by this endpoint, it is important to note that the whole point of Rosetta's stateless architecture is that it horizontally scalable. In the current architecture it is better to have a few rosetta instances than one big one. They are deploying it incorrectly at the moment or let's say suboptimally. |
They have 2 instances running behind an nginx , and they send requests to the nginx. I have asked them to stop all the traffic and send one request, but that has not helped as it still took 8 mintues. I think this is because all the db sessions were still opened and it was queries to the db were queued from initial requests. I have asked them to stop all the sessions and test again with sending one request which should be responded in a second. In the logs I could be warning about connection leak -
For them with 5000 requests/min running 2 instances and 50 cores machine, what would be the best way of performance tuning? |
The best way is add more instances for now. But as I said we will look into this time permitting and try to reproduce this. The key to fix any issue is first to have reproduceable test case, potential products to help with that:
|
I think I managed to reproduce this on our test instance while testing some other defect but I do not know why it happens yet. At the moment the best way to work around this or such issues is if you have multiple instances of Rosetta and exchanges have load balancer with round robin to switch between them. Even they have 10 rosetta instances, it is ok to gave "two broken ones". Typically load balancers take "busted" instances out of the load balacing until they recover. In our case it may never recover but I need more investigations to do tomorrow without restarting all docker services. In my case however it wasn't /search/transaction causing issues but even /block/transaction interesingly enough. I suspect both issues are caused by the same underlying problem that DB transactions are somehow leaking and not properly released. So that needs to be fixed but also in addition we should apply in this project some reactive patterns e.g. via resilience4j library such that we apply industry standard for this class and quality of the API. |
Update: what I reproduced is something else, basically instance I was using didn't have the right db indices applied. |
Context & versions
Our client is using cardano Rosetta java version 1.1.5 with docker compose. Their node has 50 cores and 96GB ram.
Here are some current status descriptions:
Mostly APIs are giving this error
{"code":5000,"message":"An error occurred","retriable":true,"details":{"message":"An error occurred for request 24b8: Unable to acquire JDBC Connection [HikariPool-1 - Connection is not available, request timed out after 100002ms.] [n/a]"}}
They have set 02 environment variables:
The above API error no longer appears but still took ~ 8 mins for /search/transactions to return a response
They also tried stopping all incoming connections, tried a request to /search/transactions end point but still took ~8 mins to wait for a response.
The text was updated successfully, but these errors were encountered: