Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfomance issue: There is sudden bump in CPU utilization, long time to get response when we send requests to /search/transactions end point #303

Open
tienna opened this issue Mar 3, 2025 · 5 comments

Comments

@tienna
Copy link

tienna commented Mar 3, 2025

Context & versions

Our client is using cardano Rosetta java version 1.1.5 with docker compose. Their node has 50 cores and 96GB ram.
Here are some current status descriptions:

  • Cardano node is already synced to the tip.
  • /network/status api gives response immediately.
  • They have about 5000 requests per minute.
  • /search/transactions api is taking 7-8 mins to respond.
  • /search/transactions api is causing cpu to spike to 200%
  • There is sudden bump in cpu utilisation when we send requests to /search/transactions end point

Mostly APIs are giving this error
{"code":5000,"message":"An error occurred","retriable":true,"details":{"message":"An error occurred for request 24b8: Unable to acquire JDBC Connection [HikariPool-1 - Connection is not available, request timed out after 100002ms.] [n/a]"}}

They have set 02 environment variables:

  1. SERVER_TOMCAT_THREADS_MAX=500
  2. SPRING_DATASOURCE_HIKARI_MAXIMUMPOOLSIZE=100

The above API error no longer appears but still took ~ 8 mins for /search/transactions to return a response

They also tried stopping all incoming connections, tried a request to /search/transactions end point but still took ~8 mins to wait for a response.

@tienna tienna added the bug Something isn't working label Mar 3, 2025
@matiwinnetou
Copy link
Collaborator

While application can be scaled and we will look into an exact performance issue caused by this endpoint, it is important to note that the whole point of Rosetta's stateless architecture is that it horizontally scalable. In the current architecture it is better to have a few rosetta instances than one big one. They are deploying it incorrectly at the moment or let's say suboptimally.

@matiwinnetou matiwinnetou added Performance Improvement and removed bug Something isn't working labels Mar 3, 2025
@Kartiiyer12
Copy link
Collaborator

Kartiiyer12 commented Mar 3, 2025

They have 2 instances running behind an nginx , and they send requests to the nginx.

I have asked them to stop all the traffic and send one request, but that has not helped as it still took 8 mintues. I think this is because all the db sessions were still opened and it was queries to the db were queued from initial requests. I have asked them to stop all the sessions and test again with sending one request which should be responded in a second.

In the logs I could be warning about connection leak -

2025-03-03T12:51:06.177Z  WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.ProxyLeakTask     : Connection leak detection triggered for org.postgresql.jdbc.PgConnection@42e58dfa on thread pool-3-thread-1, stack trace follows
 
java.lang.Exception: Apparent connection leak detected
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) ~[HikariCP-5.0.1.jar!/:na]

For them with 5000 requests/min running 2 instances and 50 cores machine, what would be the best way of performance tuning?

@matiwinnetou
Copy link
Collaborator

The best way is add more instances for now. But as I said we will look into this time permitting and try to reproduce this.

The key to fix any issue is first to have reproduceable test case, potential products to help with that:

@matiwinnetou
Copy link
Collaborator

matiwinnetou commented Mar 4, 2025

I think I managed to reproduce this on our test instance while testing some other defect but I do not know why it happens yet. At the moment the best way to work around this or such issues is if you have multiple instances of Rosetta and exchanges have load balancer with round robin to switch between them. Even they have 10 rosetta instances, it is ok to gave "two broken ones". Typically load balancers take "busted" instances out of the load balacing until they recover. In our case it may never recover but I need more investigations to do tomorrow without restarting all docker services.

In my case however it wasn't /search/transaction causing issues but even /block/transaction interesingly enough. I suspect both issues are caused by the same underlying problem that DB transactions are somehow leaking and not properly released.

So that needs to be fixed but also in addition we should apply in this project some reactive patterns e.g. via resilience4j library such that we apply industry standard for this class and quality of the API.

CC: @Kartiiyer12 @Kammerlo

@matiwinnetou
Copy link
Collaborator

Update: what I reproduced is something else, basically instance I was using didn't have the right db indices applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants