Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Performance Issues With TRSS / MongoDB #3877

Open
steelhead31 opened this issue Feb 11, 2025 · 10 comments
Open

Investigate Performance Issues With TRSS / MongoDB #3877

steelhead31 opened this issue Feb 11, 2025 · 10 comments
Assignees
Labels

Comments

@steelhead31
Copy link
Contributor

During the last release cycle , the performance of the TRSS server / MongoDB was quite poor.

This needs to be investigated.

@steelhead31
Copy link
Contributor Author

Some initial investigation on our TRSS db server..

Hi Lan,

Im going to apologize in advance for the length of this message, let me know if you'd prefer it in an email ( and an email address! ) .. I've written some scripts, and done some digging into the trss mongo db... and have a few potential issues identified, but having some points of comparison will be very useful..

  1. host_performance.sh ( https://raw.githubusercontent.com/steelhead31/infrastructure/refs/heads/trss_perf_fixes/tools/trss_performance_scripts/host_performance.sh )

This script extracts a bunch of information from a linux system about its hardware configuration, it needs no amendments, and can be run as is!

Results Should Look Something Like This..

Extracting CPU Performance Information...
CPU Model: AMD EPYC 7452 32-Core Processor
CPU Architecture: x86_64
Total CPU Cores: 2
Physical CPUs: 1
Threads per Core: 2
Max CPU Frequency:
Current CPU Frequency: Unavailable
CPU Load Average: 1 min: 0.93, 5 min: 0.92, 15 min: 0.96
CPU Usage: 54.8%
Extraction complete.

  1. docker_container_perf.sh <> ( https://raw.githubusercontent.com/steelhead31/infrastructure/refs/heads/trss_perf_fixes/tools/trss_performance_scripts/docker_container_perf.sh )
    N.B. Requires jq!
    N.B. Run the script with a single parameter , which is the name of the docker container

If you are running your mongodb inside a docker container, then some performance stats for the docker container would be useful this script extracts them.

Results Should Look Something Like This
Extracting MongoDB Container Performance Characteristics...

MongoDB Container Name: aqa-test-tools-mongo-1
MongoDB Container ID: aqa-test-tools-mongo-1
CPU Usage: 102.35%
Memory Usage: 4.425GiB MB / 7.695GiB MB
Network I/O: Received bytes, Sent bytes
Disk I/O: Read 0 Bps, Write 0 Bps
Container Uptime: 2025-01-22 10:17:30
MongoDB Running Processes: 1
Extraction complete.

  1. If using docker, identify any cpu quotas in place..

docker inspect --format='{{.HostConfig.CpuQuota}}, {{.HostConfig.CpuPeriod}}, {{.HostConfig.CpuShares}}' <container_id>
docker stats --no-stream

  1. Extract Any Slow Querys From The Docker Container Logs ( If running inside a container )

docker logs << container id >> | grep "Slow query"|grep timeReadingMicros

This may return a lot of data, it does for our instance... which I think is actually the problem being reported.

e.g. docker logs 2601dd8ff38c | grep "Slow query"|grep timeReadingMicros

{"t":{"$date":"2025-01-23T15:21:29.747+00:00"},"s":"I", "c":"WRITE", "id":51803, "ctx":"conn28","msg":"Slow query","attr":{"type":"remove","ns":"exampleDb.auditLogs","command":{"q":{"timestamp":{"$lt":{"$date":"2024-12-24T15:21:12.478Z"}}},"limit":0},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":27814188,"ndeleted":0,"numYields":27814,"queryHash":"97131801","planCacheKey":"97131801","locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":27815}},"FeatureCompatibilityVersion":{"acquireCount":{"w":27815}},"ReplicationStateTransition":{"acquireCount":{"w":27815}},"Global":{"acquireCount":{"w":27815}},"Database":{"acquireCount":{"w":27815}},"Collection":{"acquireCount":{"w":27815}},"Mutex":{"acquireCount":{"r":1}}},"flowControl":{"acquireCount":27815,"timeAcquiringMicros":9973},"storage":{"data":{"bytesRead":5983072617,"timeReadingMicros":207010}},"remote":"172.24.0.5:58150","durationMillis":17268}}

  1. Some DB related stats gathering..

Log into the mongosh, and switch to the trss DB ( ours is called exampleDb ), and check what indexes are on the testResults collection

mongosh
db.currentOp()
use exampleDb
exampleDb> db.testResults.getIndexes()
[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  { v: 2, key: { parentId: 1 }, name: 'parentId_1' },
  {
    v: 2,
    key: { url: 1, buildName: 1, buildNum: 1 },
    name: 'url_1_buildName_1_buildNum_1'
  },
  { v: 2, key: { 'tests._id': 1 }, name: 'tests._id_1' }
]

Now get an execution plan for this query.. you may need to update the URL and BuildName appropriately to your instance..

db.testResults.find({
url: "https://ci.adoptopenjdk.net",
buildName: "AQA_Test_Pipeline"
}).sort({ buildNum: 1 }).explain("executionStats")

This should give me some useful points of comparison...

@steelhead31 steelhead31 moved this from Todo to In Progress in 2025 Adoptium Plan Feb 12, 2025
@llxia
Copy link

llxia commented Feb 12, 2025

Could we start by identifying the symptoms of TRSS performance issues to investigate them effectively? It is hard to determine the root cause based solely on quite poor performance. After pinpointing the scenario, we can collect the performance metrics with a target.

@steelhead31
Copy link
Contributor Author

@llxia yes, this is what Im trying to do. I believe the poor performance is occurring when data is being inserted into trss, I think I've identified a couple of areas that are possible culprits, thus querying the DB for slow queries and checking the index states. the above commands would allow me to compare machine specifications ( as I believe our MongoDB is being limited by the underlying server and docker container configuration ), which is causing the inserts/updates etc to take significantly longer than it should. Also the indexing on the database could be improved, but I'd just to like to compare your indexing with ours to see if you have the same issues ( although less pronounced ).

@smlambert is there an "easy" way to replicate this issue on our trss, I could definitely do some more interactive performance monitoring whilst the issue is occurring, rather than relying on the container/mongodb logs

@smlambert
Copy link
Contributor

is there an "easy" way to replicate this issue on our trss, I could definitely do some more interactive performance monitoring whilst the issue is occurring, rather than relying on the container/mongodb logs

Anectdotally, it takes up to several days to see a full (completed) pipeline in a Grid view in TRSS, which is not good enough for use during releases. This performance was not the case previously, but several changes have occurred (1) running in containers and (2) Azure machine that this server is running on has changed (with reduced capacity).

@steelhead31 - depending on how involved you want to be, 2 options:
a) you can run a local instance of TRSS for debugging (this is preferred approach and gives most flexibility, after installing prereqs, run via Docker compose instructions here)
b) use DB client tool to connect to live instance

@steelhead31
Copy link
Contributor Author

Thanks, I'll try those.. I do believe the reduction to a 2 CPU machine is a large part of the issue, as the MongoDB docker container is using most of the machines resources even when idle. There is sufficient memory , and the disk is fast, but analysing the slow_query logs suggest CPU is the likely culprit. Some additional indexing ( particularly on the auditlog and testresults collections ) may also be beneficial

@steelhead31
Copy link
Contributor Author

Ok, so after some extensive performance analysis...

  1. Lack Of Indexing

Most of the SLOW_QUERY log entries are related to the DB performing a COLLSCAN instead of using indexes. This is a major performance issue, especially when dealing with large collections the 2 biggest culprits are auditLogs and testResults.

The query logs show that each operation scans millions of documents without using indexes...

"planSummary":"COLLSCAN","docsExamined":20818444

This means MongoDB is scanning the entire collection instead of using an efficient index.

  1. Large Data Reads & Locking

Queries are reading gigabytes of data (bytesRead: 5,350,000,000+), causing high I/O.
The logs indicate frequent locking on collections, databases, and replication state transitions.

  1. Deleting Old Data in auditLogs

Many slow queries are DELETE operations on exampleDb.auditLogs, filtering by timestamp:

"delete":"auditLogs","deletes":[{"q":{"timestamp":{"$lt":...}}}]

This means that MongoDB is scanning all documents to find ones that match.

  1. Fetching Non-Done Test Results

The query filtering for status: {$ne: "Done"} also uses a collection scan:

"find":"testResults","filter":{"status":{"$ne":"Done"}}

@sxa
Copy link
Member

sxa commented Feb 14, 2025

I do believe the reduction to a 2 CPU machine is a large part of the issue

@steelhead31 What is that based on? From what I could see it was only ever chewing up 100% of one CPU which would not indicate an issue with core count specifically.

@steelhead31
Copy link
Contributor Author

I've had several top commands in the container, averaging 135% of the CPU...

@sxa
Copy link
Member

sxa commented Feb 14, 2025

For the record it's showing this right now, which would typically indicate one full core in use.

Image

Even if that had 135% as opposed to 102% (was that purely on the mongod process too?) that would not indicate that it's exceeding 2 cores worth of CPU.

The amount of swap it's using may be a bit of a concern though.

@steelhead31
Copy link
Contributor Author

Im inclined to try adding some indexing before anything else.. and investigating whats in the auditlog, as that's by far the biggest volume of data it attempts to read for each operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

4 participants