Skip to content

Commit 68f93f4

Browse files
authored
Merge pull request #1755 from galderz/topic.2107.blog-rss-debugging
Blog post: Analysing Quarkus Native startup RSS
2 parents d0596c9 + 381559c commit 68f93f4

7 files changed

+4858
-1
lines changed

Gemfile.lock

+1-1
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ GEM
4949
rexml
5050
kramdown-parser-gfm (1.1.0)
5151
kramdown (~> 2.0)
52-
liquid (4.0.3)
52+
liquid (4.0.4)
5353
listen (3.5.1)
5454
rb-fsevent (~> 0.10, >= 0.10.3)
5555
rb-inotify (~> 0.9, >= 0.9.10)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
---
2+
layout: post
3+
title: 'Analysing Quarkus Native startup RSS consumption'
4+
date: 2023-07-27
5+
tags: native rss jfr
6+
synopsis: 'Use Linux native tooling and JFR to understand Quarkus Native startup RSS consumption'
7+
author: galderz
8+
---
9+
:imagesdir: /assets/images/posts/native-startup-rss-troubleshooting
10+
11+
During the development of Quarkus 2.13,
12+
we discovered that there was a startup degradation in native mode.
13+
One of the key aspects of this degradation was that RSS consumption on start up had gone up by about 10-15% compared to Quarkus 2.7.
14+
In this blog post you will learn how the Quarkus native RSS consumption debugging techniques described in
15+
https://quarkus.io/guides/native-reference#rss[the Quarkus native reference guide]
16+
were used to diagnose this issue.
17+
You will also learn about recent updates in GraalVM that make doing this kind of analysis more convenient.
18+
19+
Our analysis showed that the RSS consumption for Quarkus 2.7 was about ~33MB while for Quarkus 2.13 it was about ~36MB, roughly a 10 % increase.
20+
Below we see how we measured the RSS consumption for Quarkus 2.7.6.Final:
21+
22+
[source,bash]
23+
----
24+
$ getting-started/target/getting-started-1.0.0-SNAPSHOT-runner -Xmx64m
25+
...
26+
2023-07-21 10:13:11,304 INFO [io.quarkus] (main) getting-started 1.0.0-SNAPSHOT native (powered by Quarkus 2.7.6.Final) started in 0.023s. Listening on: http://0.0.0.0:8080
27+
28+
$ ps -p $(pidof getting-started-1.0.0-SNAPSHOT-runner) -o rss=
29+
33024
30+
----
31+
32+
Similarly, below we see how we measured the RSS consumption for Quarkus 2.13.4.Final:
33+
In our environment, the RSS consumption had gone up to about ~36M, roughly a 10% increase:
34+
35+
[source,bash]
36+
----
37+
$ getting-started/target/getting-started-1.0.0-SNAPSHOT-runner -Xmx64m
38+
...
39+
2023-07-21 10:24:38,529 INFO [io.quarkus] (main) getting-started 1.0.0-SNAPSHOT native (powered by Quarkus 2.13.4.Final) started in 0.027s. Listening on: http://0.0.0.0:8080
40+
41+
$ ps -p $(pidof getting-started-1.0.0-SNAPSHOT-runner) -o rss=
42+
36608
43+
----
44+
45+
Since Quarkus 2.7 and 2.13 used different GraalVM versions (21.3 vs 22.3),
46+
we first investigated whether the increase in the startup time and RSS consumption was due to a change in GraalVM.
47+
This turned out to not be the case,
48+
as using the same GraalVM version on both Quarkus versions still resulted in similar differences.
49+
50+
We also profiled the startup time in JVM mode,
51+
and although we could observe some slight increase in heap usage,
52+
the RSS increase was not apparent at runtime.
53+
So, the challenge was to understand what was causing the RSS increase that seemed to only affect native mode
54+
and that was Quarkus specific.
55+
56+
In native mode we needed to look for system calls that increased RSS,
57+
and investigate what caused them.
58+
There are several system calls that can be used to allocate memory, e.g. `malloc`, `mmap`...etc,
59+
but when we analysed the system calls emitted on startup,
60+
we discovered that `mmap` calls were the prominent ones.
61+
Following
62+
https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html[Brendan Gregg’s Memory Leak (and Growth) Flame Graphs guide]
63+
on a Linux environment,
64+
we were able to produce flamegraphs for the `mmap` system calls.
65+
To obtain these flamegraphs, we capture the `sys_enter_mmap` system call events with `perf record`,
66+
and then we generate flamegraphs with the recorded data:
67+
68+
[source,bash]
69+
----
70+
$ sudo perf record -e syscalls:sys_enter_mmap --call-graph dwarf \
71+
-- getting-started/target/getting-started-1.0.0-SNAPSHOT-runner -Xmx64m
72+
...
73+
2023-07-21 10:13:11,304 INFO [io.quarkus] (main) getting-started 1.0.0-SNAPSHOT native (powered by Quarkus 2.7.6.Final) started in 0.023s. Listening on: http://0.0.0.0:8080
74+
75+
$ perf script -i perf.data > out.stacks
76+
77+
$ /opt/FlameGraph/stackcollapse-perf.pl < out.stacks \
78+
| /opt/FlameGraph/flamegraph.pl \
79+
--color=mem \
80+
--title="mmap Flame Graph" \
81+
--countname="calls" > out.svg
82+
----
83+
84+
[NOTE]
85+
====
86+
The native executable above was created passing in
87+
`-Dquarkus.native.debug.enabled` and `-Dquarkus.native.additional-build-args=-H:-DeleteLocalSymbols`
88+
as extra build flags.
89+
====
90+
91+
`mmap` calls happen for a variety of reasons,
92+
but in the context of this blog post,
93+
we're interested in those `mmap` calls that are triggered by object allocations.
94+
Not every object allocation triggers an `mmap` call.
95+
Instead, each thread in SubstrateVM will allocate a heap chunk using `mmap` when the first object is allocated,
96+
and it will use this heap chunk as long as there's space for further objects allocated in the same thread.
97+
When the heap chunk is full,
98+
the thread will request a new heap chunk using `mmap`.
99+
This pattern is called thread-local allocation.
100+
HotSpot has a similar feature as well,
101+
but there the heap chunk size is dynamically computed depending on multiple factors,
102+
while on SubstrateVM the size is fixed.
103+
At the time of writing, the default heap chunk size is 1MB,
104+
but an upcoming change will make it 512KB instead.
105+
106+
We describe the object allocation as fast when the thread has a heap chunk cached and there's enough space for the object.
107+
We describe the object allocation as slow when the thread needs to request a new heap chunk via the `mmap` system call to satisfy the object allocation.
108+
Slow allocations are the most interesting in this case,
109+
because they give us a rough indication of which allocations are pushing the number of heap chunks up,
110+
and therefore are pushing the RSS consumption up.
111+
In SubstrateVM, stacktraces of the slow allocation path contain invocations to methods defined in the `com.oracle.svm.core.genscavenge.ThreadLocalAllocation` class that contain the `slowPathNew` prefix in their name .
112+
E.g. `slowPathNewInstance` for plain objects or `slowPathNewArray` for arrays.
113+
114+
The areas of interest in the flamegraphs are those stacktraces that contain `ApplicationImpl.doStart()` method call,
115+
because that's the method that is responsible for starting up Quarkus applications.
116+
We want to find out how many slow path allocations are visible in these stacktraces,
117+
which are executed by the `main` thread.
118+
Also, by looking at the stacktraces that cause these slow path allocations,
119+
we can get an idea of which components might be allocation heavy.
120+
If you focus the flamegraph on that method,
121+
and then click on `Search` at the top right and type `slowPathNew`,
122+
you can observe which of the `sys_enter_mmap` system calls are related to native heap memory allocation.
123+
124+
For Quarkus 2.7.6.Final, the flamegraph looks like this:
125+
126+
image::rss-before-mmap.svg[startup rss startup flamegraph for Quarkus 2.7.6.Final]
127+
128+
And here's a screenshot highlighting the stacktraces that contain slow path allocations:
129+
130+
image::rss-before-slowPathNew.png[slow path new allocations for Quarkus 2.7.6.Final]
131+
132+
Observing the number of stacktraces that contain `slowPathNew` invocations above, i.e. 2, we can say that the `main` thread in Quarkus 2.7.6.Final allocates roughly 2MB of heap chunks.
133+
134+
For Quarkus 2.13.4.Final, the flamegraph looks like this:
135+
136+
image::rss-after-mmap.svg[startup rss startup flamegraph for Quarkus 2.13.4.Final]
137+
138+
Focusing on the same area, we observe that the `slowPathNew` stacktrace count is 5 in this case, so the `main` thread allocates roughly 5MB of heap chunks:
139+
140+
image::rss-after-slowPathNew.png[slow path new allocations for Quarkus 2.13.4.Final]
141+
142+
The increase in object allocation at runtime startup between these two Quarkus versions,
143+
combined with the idiosyncrasies of how thread local allocation works in SubstrateVM,
144+
is what was causing the increase in RSS consumption.
145+
In other words, even though both HotSpot and SubstrateVM experience higher allocations,
146+
SubstrateVM's fixed thread-local allocation heap chunk sizes make this more apparent,
147+
compared to the dynamic thread-local allocation logic in HotSpot.
148+
149+
Since then, improvements were made to the components that caused the extra object allocation.
150+
We were able to detect the changes that had caused these extra allocations
151+
by looking at the components that appeared in the stacktrace leading to slow allocations,
152+
and inspecting updates that had happened in those components.
153+
In this case, updates to the SmallRye configuration handling at startup caused the regression.
154+
155+
[NOTE]
156+
====
157+
The stacktraces that cause slow object allocations give us an approximation on the components that cause memory usage increase.
158+
There could be situations where the stacktrace leading to a slow allocation is perfectly valid,
159+
but due to the non-deterministic nature of object allocation,
160+
the stacktrace just happens to be the one that caused a new heap chunk to be allocated.
161+
However, if you see multiple stacktraces in the same area,
162+
that's probably a good hint that something in that area needs to be investigated.
163+
====
164+
165+
By Quarkus version 2.13.7.Final, the RSS consumption on startup was back to previous levels.
166+
The RSS regression was introduced by the changes in
167+
https://github.com/quarkusio/quarkus/pull/26802#issue-1308798216[PR #26802],
168+
and the fixes in https://github.com/quarkusio/quarkus/pull/29408[PR #29408]
169+
and https://github.com/quarkusio/quarkus/pull/29842[PR #29842]
170+
brought it back down.
171+
172+
== Performing the analysis with `ObjectAllocationInNewTLAB` JFR event
173+
174+
As newer versions of GraalVM implement https://github.com/oracle/graal/issues/5410[more and more JFR events] in native mode,
175+
it also becomes easier to analyse Quarkus application performance.
176+
For instance the recent release of GraalVM for JDK 17/20 adds support for the `jdk.ObjectAllocationInNewTLAB` event,
177+
which would have highlighted the discrepancies in the Quarkus versions compared above.
178+
Unfortunately, the Quarkus versions tested here are not compatible with this GraalVM version,
179+
but we can give it a go with the latest Quarkus version to see what comes out.
180+
181+
First, we build the native Quarkus application with JFR support:
182+
183+
[source,bash]
184+
----
185+
$ ./mvnw package -DskipTests -Dquarkus.native.monitoring=jfr -Dnative
186+
----
187+
188+
To obtain `jdk.ObjectAllocationInNewTLAB` events,
189+
a custom JFC configuration file is required.
190+
We use `jfr configure` to generate it:
191+
192+
[source,bash]
193+
----
194+
$ $JAVA_HOME/bin/jfr configure jdk.ObjectAllocationInNewTLAB#enabled=true --output newtlab.jfc
195+
----
196+
197+
Then, we start the Quarkus native executable with the necessary JFR flags:
198+
199+
[source,bash]
200+
----
201+
$ getting-started/target/getting-started-1.0.0-SNAPSHOT-runner -XX:+FlightRecorder -XX:StartFlightRecording=settings=newtlab.jfc,filename=recording.jfr -Xmx64m
202+
...
203+
2023-07-21 12:25:33,739 INFO [io.quarkus] (main) getting-started 1.0.0-SNAPSHOT native (powered by Quarkus 3.2.1.Final) started in 0.019s. Listening on: http://0.0.0.0:8080
204+
----
205+
206+
After shutting down Quarkus,
207+
we can use https://adoptium.net/jmc[Adoptium's Eclipse Mission Control]
208+
to visualize a flamegraph for all the `jdk.ObjectAllocationInNewTLAB` events.
209+
We can also do a similar focus on `ApplicationImpl.doStart()` method to observe which slow path allocations get triggered from there:
210+
211+
image::jmc-new-tlab-event-from-doStart.png[new tlab events from ApplicationImpl.doStart() method Quarkus 3.2.1.Final]
212+
213+
Looking at the flamegraph,
214+
we can count 7 stacktraces for slow path allocations for the `main` thread.
215+
Whether this is a problem,
216+
or maybe code execution has shifted from another thread to `main`,
217+
would be something we would need to explore further and see if there's a regression.
218+
219+
If you encounter memory issues in your native applications,
220+
or any other type of performance issues,
221+
do not hesitate giving JFR a go.
222+
Robert Toyonaga has written a couple of blog posts on the latest JFR additions,
223+
see link:/blog/profile-and-monitor-native-executables-with-jfr[here] to find out more.
Loading

0 commit comments

Comments
 (0)