Perfomance issue: There is sudden bump in CPU utilization, long time to get response when we send requests to /search/transactions end point #303

tienna · 2025-03-03T14:56:52Z

Context & versions

Our client is using cardano Rosetta java version 1.1.5 with docker compose. Their node has 50 cores and 96GB ram.
Here are some current status descriptions:

Cardano node is already synced to the tip.
/network/status api gives response immediately.
They have about 5000 requests per minute.
/search/transactions api is taking 7-8 mins to respond.
/search/transactions api is causing cpu to spike to 200%
There is sudden bump in cpu utilisation when we send requests to /search/transactions end point

Mostly APIs are giving this error
{"code":5000,"message":"An error occurred","retriable":true,"details":{"message":"An error occurred for request 24b8: Unable to acquire JDBC Connection [HikariPool-1 - Connection is not available, request timed out after 100002ms.] [n/a]"}}

They have set 02 environment variables:

SERVER_TOMCAT_THREADS_MAX=500
SPRING_DATASOURCE_HIKARI_MAXIMUMPOOLSIZE=100

The above API error no longer appears but still took ~ 8 mins for /search/transactions to return a response

They also tried stopping all incoming connections, tried a request to /search/transactions end point but still took ~8 mins to wait for a response.

The text was updated successfully, but these errors were encountered:

matiwinnetou · 2025-03-03T15:11:12Z

While application can be scaled and we will look into an exact performance issue caused by this endpoint, it is important to note that the whole point of Rosetta's stateless architecture is that it horizontally scalable. In the current architecture it is better to have a few rosetta instances than one big one. They are deploying it incorrectly at the moment or let's say suboptimally.

Kartiiyer12 · 2025-03-03T17:42:10Z

They have 2 instances running behind an nginx , and they send requests to the nginx.

I have asked them to stop all the traffic and send one request, but that has not helped as it still took 8 mintues. I think this is because all the db sessions were still opened and it was queries to the db were queued from initial requests. I have asked them to stop all the sessions and test again with sending one request which should be responded in a second.

In the logs I could be warning about connection leak -

2025-03-03T12:51:06.177Z  WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.ProxyLeakTask     : Connection leak detection triggered for org.postgresql.jdbc.PgConnection@42e58dfa on thread pool-3-thread-1, stack trace follows
 
java.lang.Exception: Apparent connection leak detected
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) ~[HikariCP-5.0.1.jar!/:na]

For them with 5000 requests/min running 2 instances and 50 cores machine, what would be the best way of performance tuning?

matiwinnetou · 2025-03-03T20:43:51Z

The best way is add more instances for now. But as I said we will look into this time permitting and try to reproduce this.

The key to fix any issue is first to have reproduceable test case, potential products to help with that:

Apache Ab
Gatling - https://gatling.io/

matiwinnetou · 2025-03-04T20:21:02Z

I think I managed to reproduce this on our test instance while testing some other defect but I do not know why it happens yet. At the moment the best way to work around this or such issues is if you have multiple instances of Rosetta and exchanges have load balancer with round robin to switch between them. Even they have 10 rosetta instances, it is ok to gave "two broken ones". Typically load balancers take "busted" instances out of the load balacing until they recover. In our case it may never recover but I need more investigations to do tomorrow without restarting all docker services.

In my case however it wasn't /search/transaction causing issues but even /block/transaction interesingly enough. I suspect both issues are caused by the same underlying problem that DB transactions are somehow leaking and not properly released.

So that needs to be fixed but also in addition we should apply in this project some reactive patterns e.g. via resilience4j library such that we apply industry standard for this class and quality of the API.

CC: @Kartiiyer12 @Kammerlo

matiwinnetou · 2025-03-05T11:48:38Z

Update: what I reproduced is something else, basically instance I was using didn't have the right db indices applied.

tienna added the bug Something isn't working label Mar 3, 2025

matiwinnetou added Performance Improvement and removed bug Something isn't working labels Mar 3, 2025

matiwinnetou added the low label Mar 4, 2025

matiwinnetou mentioned this issue Mar 5, 2025

feat: propagation supports is not recommended without open db transaction and can lead to connection leak issues. #309

Merged

linconvidal mentioned this issue Mar 7, 2025

feat: Add Rosetta stability test configuration and sample data for mainnet and preprod #317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perfomance issue: There is sudden bump in CPU utilization, long time to get response when we send requests to /search/transactions end point #303

Perfomance issue: There is sudden bump in CPU utilization, long time to get response when we send requests to /search/transactions end point #303

tienna commented Mar 3, 2025

matiwinnetou commented Mar 3, 2025

Kartiiyer12 commented Mar 3, 2025 •

edited

Loading

matiwinnetou commented Mar 3, 2025

matiwinnetou commented Mar 4, 2025 •

edited

Loading

matiwinnetou commented Mar 5, 2025

Perfomance issue: There is sudden bump in CPU utilization, long time to get response when we send requests to /search/transactions end point #303

Perfomance issue: There is sudden bump in CPU utilization, long time to get response when we send requests to /search/transactions end point #303

Comments

tienna commented Mar 3, 2025

Context & versions

matiwinnetou commented Mar 3, 2025

Kartiiyer12 commented Mar 3, 2025 • edited Loading

matiwinnetou commented Mar 3, 2025

matiwinnetou commented Mar 4, 2025 • edited Loading

matiwinnetou commented Mar 5, 2025

Kartiiyer12 commented Mar 3, 2025 •

edited

Loading

matiwinnetou commented Mar 4, 2025 •

edited

Loading