Investigate Performance Issues With TRSS / MongoDB #3877

steelhead31 · 2025-02-11T16:24:37Z

During the last release cycle , the performance of the TRSS server / MongoDB was quite poor.

This needs to be investigated.

steelhead31 · 2025-02-12T10:31:57Z

Some initial investigation on our TRSS db server..

Hi Lan,

Im going to apologize in advance for the length of this message, let me know if you'd prefer it in an email ( and an email address! ) .. I've written some scripts, and done some digging into the trss mongo db... and have a few potential issues identified, but having some points of comparison will be very useful..

host_performance.sh ( https://raw.githubusercontent.com/steelhead31/infrastructure/refs/heads/trss_perf_fixes/tools/trss_performance_scripts/host_performance.sh )

This script extracts a bunch of information from a linux system about its hardware configuration, it needs no amendments, and can be run as is!

Results Should Look Something Like This..

Extracting CPU Performance Information...
CPU Model: AMD EPYC 7452 32-Core Processor
CPU Architecture: x86_64
Total CPU Cores: 2
Physical CPUs: 1
Threads per Core: 2
Max CPU Frequency:
Current CPU Frequency: Unavailable
CPU Load Average: 1 min: 0.93, 5 min: 0.92, 15 min: 0.96
CPU Usage: 54.8%
Extraction complete.

docker_container_perf.sh <> ( https://raw.githubusercontent.com/steelhead31/infrastructure/refs/heads/trss_perf_fixes/tools/trss_performance_scripts/docker_container_perf.sh )
N.B. Requires jq!
N.B. Run the script with a single parameter , which is the name of the docker container

If you are running your mongodb inside a docker container, then some performance stats for the docker container would be useful this script extracts them.

Results Should Look Something Like This
Extracting MongoDB Container Performance Characteristics...

MongoDB Container Name: aqa-test-tools-mongo-1
MongoDB Container ID: aqa-test-tools-mongo-1
CPU Usage: 102.35%
Memory Usage: 4.425GiB MB / 7.695GiB MB
Network I/O: Received bytes, Sent bytes
Disk I/O: Read 0 Bps, Write 0 Bps
Container Uptime: 2025-01-22 10:17:30
MongoDB Running Processes: 1
Extraction complete.

If using docker, identify any cpu quotas in place..

docker inspect --format='{{.HostConfig.CpuQuota}}, {{.HostConfig.CpuPeriod}}, {{.HostConfig.CpuShares}}' <container_id>
docker stats --no-stream

Extract Any Slow Querys From The Docker Container Logs ( If running inside a container )

docker logs << container id >> | grep "Slow query"|grep timeReadingMicros

This may return a lot of data, it does for our instance... which I think is actually the problem being reported.

e.g. docker logs 2601dd8ff38c | grep "Slow query"|grep timeReadingMicros

{"t":{"$date":"2025-01-23T15:21:29.747+00:00"},"s":"I", "c":"WRITE", "id":51803, "ctx":"conn28","msg":"Slow query","attr":{"type":"remove","ns":"exampleDb.auditLogs","command":{"q":{"timestamp":{"$lt":{"$date":"2024-12-24T15:21:12.478Z"}}},"limit":0},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":27814188,"ndeleted":0,"numYields":27814,"queryHash":"97131801","planCacheKey":"97131801","locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":27815}},"FeatureCompatibilityVersion":{"acquireCount":{"w":27815}},"ReplicationStateTransition":{"acquireCount":{"w":27815}},"Global":{"acquireCount":{"w":27815}},"Database":{"acquireCount":{"w":27815}},"Collection":{"acquireCount":{"w":27815}},"Mutex":{"acquireCount":{"r":1}}},"flowControl":{"acquireCount":27815,"timeAcquiringMicros":9973},"storage":{"data":{"bytesRead":5983072617,"timeReadingMicros":207010}},"remote":"172.24.0.5:58150","durationMillis":17268}}

Some DB related stats gathering..

Log into the mongosh, and switch to the trss DB ( ours is called exampleDb ), and check what indexes are on the testResults collection

mongosh
db.currentOp()
use exampleDb
exampleDb> db.testResults.getIndexes()
[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  { v: 2, key: { parentId: 1 }, name: 'parentId_1' },
  {
    v: 2,
    key: { url: 1, buildName: 1, buildNum: 1 },
    name: 'url_1_buildName_1_buildNum_1'
  },
  { v: 2, key: { 'tests._id': 1 }, name: 'tests._id_1' }
]

Now get an execution plan for this query.. you may need to update the URL and BuildName appropriately to your instance..

db.testResults.find({
url: "https://ci.adoptopenjdk.net",
buildName: "AQA_Test_Pipeline"
}).sort({ buildNum: 1 }).explain("executionStats")

This should give me some useful points of comparison...

llxia · 2025-02-12T14:02:59Z

Could we start by identifying the symptoms of TRSS performance issues to investigate them effectively? It is hard to determine the root cause based solely on quite poor performance. After pinpointing the scenario, we can collect the performance metrics with a target.

steelhead31 · 2025-02-12T15:14:25Z

@llxia yes, this is what Im trying to do. I believe the poor performance is occurring when data is being inserted into trss, I think I've identified a couple of areas that are possible culprits, thus querying the DB for slow queries and checking the index states. the above commands would allow me to compare machine specifications ( as I believe our MongoDB is being limited by the underlying server and docker container configuration ), which is causing the inserts/updates etc to take significantly longer than it should. Also the indexing on the database could be improved, but I'd just to like to compare your indexing with ours to see if you have the same issues ( although less pronounced ).

@smlambert is there an "easy" way to replicate this issue on our trss, I could definitely do some more interactive performance monitoring whilst the issue is occurring, rather than relying on the container/mongodb logs

smlambert · 2025-02-12T15:36:26Z

is there an "easy" way to replicate this issue on our trss, I could definitely do some more interactive performance monitoring whilst the issue is occurring, rather than relying on the container/mongodb logs

Anectdotally, it takes up to several days to see a full (completed) pipeline in a Grid view in TRSS, which is not good enough for use during releases. This performance was not the case previously, but several changes have occurred (1) running in containers and (2) Azure machine that this server is running on has changed (with reduced capacity).

@steelhead31 - depending on how involved you want to be, 2 options:
a) you can run a local instance of TRSS for debugging (this is preferred approach and gives most flexibility, after installing prereqs, run via Docker compose instructions here)
b) use DB client tool to connect to live instance

steelhead31 · 2025-02-12T15:40:28Z

Thanks, I'll try those.. I do believe the reduction to a 2 CPU machine is a large part of the issue, as the MongoDB docker container is using most of the machines resources even when idle. There is sufficient memory , and the disk is fast, but analysing the slow_query logs suggest CPU is the likely culprit. Some additional indexing ( particularly on the auditlog and testresults collections ) may also be beneficial

steelhead31 · 2025-02-14T12:03:32Z

Ok, so after some extensive performance analysis...

Lack Of Indexing

Most of the SLOW_QUERY log entries are related to the DB performing a COLLSCAN instead of using indexes. This is a major performance issue, especially when dealing with large collections the 2 biggest culprits are auditLogs and testResults.

The query logs show that each operation scans millions of documents without using indexes...

"planSummary":"COLLSCAN","docsExamined":20818444

This means MongoDB is scanning the entire collection instead of using an efficient index.

Large Data Reads & Locking

Queries are reading gigabytes of data (bytesRead: 5,350,000,000+), causing high I/O.
The logs indicate frequent locking on collections, databases, and replication state transitions.

Deleting Old Data in auditLogs

Many slow queries are DELETE operations on exampleDb.auditLogs, filtering by timestamp:

"delete":"auditLogs","deletes":[{"q":{"timestamp":{"$lt":...}}}]

This means that MongoDB is scanning all documents to find ones that match.

Fetching Non-Done Test Results

The query filtering for status: {$ne: "Done"} also uses a collection scan:

"find":"testResults","filter":{"status":{"$ne":"Done"}}

sxa · 2025-02-14T12:40:49Z

I do believe the reduction to a 2 CPU machine is a large part of the issue

@steelhead31 What is that based on? From what I could see it was only ever chewing up 100% of one CPU which would not indicate an issue with core count specifically.

steelhead31 · 2025-02-14T12:43:12Z

I've had several top commands in the container, averaging 135% of the CPU...

sxa · 2025-02-14T13:15:43Z

For the record it's showing this right now, which would typically indicate one full core in use.

Even if that had 135% as opposed to 102% (was that purely on the mongod process too?) that would not indicate that it's exceeding 2 cores worth of CPU.

The amount of swap it's using may be a bit of a concern though.

steelhead31 · 2025-02-14T13:29:18Z

Im inclined to try adding some indexing before anything else.. and investigating whats in the auditlog, as that's by far the biggest volume of data it attempts to read for each operation.

steelhead31 self-assigned this Feb 11, 2025

steelhead31 added the critical label Feb 11, 2025

steelhead31 added this to 2025 Adoptium Plan Feb 11, 2025

steelhead31 moved this to Todo in 2025 Adoptium Plan Feb 11, 2025

steelhead31 moved this from Todo to In Progress in 2025 Adoptium Plan Feb 12, 2025

llxia mentioned this issue Feb 13, 2025

Aqa metrics/#5121 adoptium/aqa-test-tools#972

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Performance Issues With TRSS / MongoDB #3877

Investigate Performance Issues With TRSS / MongoDB #3877

steelhead31 commented Feb 11, 2025

steelhead31 commented Feb 12, 2025

llxia commented Feb 12, 2025

steelhead31 commented Feb 12, 2025

smlambert commented Feb 12, 2025

steelhead31 commented Feb 12, 2025

steelhead31 commented Feb 14, 2025

sxa commented Feb 14, 2025

steelhead31 commented Feb 14, 2025

sxa commented Feb 14, 2025 •

edited

Loading

steelhead31 commented Feb 14, 2025

Investigate Performance Issues With TRSS / MongoDB #3877

Investigate Performance Issues With TRSS / MongoDB #3877

Comments

steelhead31 commented Feb 11, 2025

steelhead31 commented Feb 12, 2025

llxia commented Feb 12, 2025

steelhead31 commented Feb 12, 2025

smlambert commented Feb 12, 2025

steelhead31 commented Feb 12, 2025

steelhead31 commented Feb 14, 2025

sxa commented Feb 14, 2025

steelhead31 commented Feb 14, 2025

sxa commented Feb 14, 2025 • edited Loading

steelhead31 commented Feb 14, 2025

sxa commented Feb 14, 2025 •

edited

Loading