|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: 'Analysing Quarkus Native startup RSS consumption' |
| 4 | +date: 2023-07-27 |
| 5 | +tags: native rss jfr |
| 6 | +synopsis: 'Use Linux native tooling and JFR to understand Quarkus Native startup RSS consumption' |
| 7 | +author: galderz |
| 8 | +--- |
| 9 | +:imagesdir: /assets/images/posts/native-startup-rss-troubleshooting |
| 10 | + |
| 11 | +During the development of Quarkus 2.13, |
| 12 | +we discovered that there was a startup degradation in native mode. |
| 13 | +One of the key aspects of this degradation was that RSS consumption on start up had gone up by about 10-15% compared to Quarkus 2.7. |
| 14 | +In this blog post you will learn how the Quarkus native RSS consumption debugging techniques described in |
| 15 | +https://quarkus.io/guides/native-reference#rss[the Quarkus native reference guide] |
| 16 | +were used to diagnose this issue. |
| 17 | +You will also learn about recent updates in GraalVM that make doing this kind of analysis more convenient. |
| 18 | + |
| 19 | +Our analysis showed that the RSS consumption for Quarkus 2.7 was about ~33MB while for Quarkus 2.13 it was about ~36MB, roughly a 10 % increase. |
| 20 | +Below we see how we measured the RSS consumption for Quarkus 2.7.6.Final: |
| 21 | + |
| 22 | +[source,bash] |
| 23 | +---- |
| 24 | +$ getting-started/target/getting-started-1.0.0-SNAPSHOT-runner -Xmx64m |
| 25 | +... |
| 26 | +2023-07-21 10:13:11,304 INFO [io.quarkus] (main) getting-started 1.0.0-SNAPSHOT native (powered by Quarkus 2.7.6.Final) started in 0.023s. Listening on: http://0.0.0.0:8080 |
| 27 | +
|
| 28 | +$ ps -p $(pidof getting-started-1.0.0-SNAPSHOT-runner) -o rss= |
| 29 | +33024 |
| 30 | +---- |
| 31 | + |
| 32 | +Similarly, below we see how we measured the RSS consumption for Quarkus 2.13.4.Final: |
| 33 | +In our environment, the RSS consumption had gone up to about ~36M, roughly a 10% increase: |
| 34 | + |
| 35 | +[source,bash] |
| 36 | +---- |
| 37 | +$ getting-started/target/getting-started-1.0.0-SNAPSHOT-runner -Xmx64m |
| 38 | +... |
| 39 | +2023-07-21 10:24:38,529 INFO [io.quarkus] (main) getting-started 1.0.0-SNAPSHOT native (powered by Quarkus 2.13.4.Final) started in 0.027s. Listening on: http://0.0.0.0:8080 |
| 40 | +
|
| 41 | +$ ps -p $(pidof getting-started-1.0.0-SNAPSHOT-runner) -o rss= |
| 42 | +36608 |
| 43 | +---- |
| 44 | + |
| 45 | +Since Quarkus 2.7 and 2.13 used different GraalVM versions (21.3 vs 22.3), |
| 46 | +we first investigated whether the increase in the startup time and RSS consumption was due to a change in GraalVM. |
| 47 | +This turned out to not be the case, |
| 48 | +as using the same GraalVM version on both Quarkus versions still resulted in similar differences. |
| 49 | + |
| 50 | +We also profiled the startup time in JVM mode, |
| 51 | +and although we could observe some slight increase in heap usage, |
| 52 | +the RSS increase was not apparent at runtime. |
| 53 | +So, the challenge was to understand what was causing the RSS increase that seemed to only affect native mode |
| 54 | +and that was Quarkus specific. |
| 55 | + |
| 56 | +In native mode we needed to look for system calls that increased RSS, |
| 57 | +and investigate what caused them. |
| 58 | +There are several system calls that can be used to allocate memory, e.g. `malloc`, `mmap`...etc, |
| 59 | +but when we analysed the system calls emitted on startup, |
| 60 | +we discovered that `mmap` calls were the prominent ones. |
| 61 | +Following |
| 62 | +https://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html[Brendan Gregg’s Memory Leak (and Growth) Flame Graphs guide] |
| 63 | +on a Linux environment, |
| 64 | +we were able to produce flamegraphs for the `mmap` system calls. |
| 65 | +To obtain these flamegraphs, we capture the `sys_enter_mmap` system call events with `perf record`, |
| 66 | +and then we generate flamegraphs with the recorded data: |
| 67 | + |
| 68 | +[source,bash] |
| 69 | +---- |
| 70 | +$ sudo perf record -e syscalls:sys_enter_mmap --call-graph dwarf \ |
| 71 | + -- getting-started/target/getting-started-1.0.0-SNAPSHOT-runner -Xmx64m |
| 72 | +... |
| 73 | +2023-07-21 10:13:11,304 INFO [io.quarkus] (main) getting-started 1.0.0-SNAPSHOT native (powered by Quarkus 2.7.6.Final) started in 0.023s. Listening on: http://0.0.0.0:8080 |
| 74 | +
|
| 75 | +$ perf script -i perf.data > out.stacks |
| 76 | +
|
| 77 | +$ /opt/FlameGraph/stackcollapse-perf.pl < out.stacks \ |
| 78 | + | /opt/FlameGraph/flamegraph.pl \ |
| 79 | + --color=mem \ |
| 80 | + --title="mmap Flame Graph" \ |
| 81 | + --countname="calls" > out.svg |
| 82 | +---- |
| 83 | + |
| 84 | +[NOTE] |
| 85 | +==== |
| 86 | +The native executable above was created passing in |
| 87 | +`-Dquarkus.native.debug.enabled` and `-Dquarkus.native.additional-build-args=-H:-DeleteLocalSymbols` |
| 88 | +as extra build flags. |
| 89 | +==== |
| 90 | + |
| 91 | +`mmap` calls happen for a variety of reasons, |
| 92 | +but in the context of this blog post, |
| 93 | +we're interested in those `mmap` calls that are triggered by object allocations. |
| 94 | +Not every object allocation triggers an `mmap` call. |
| 95 | +Instead, each thread in SubstrateVM will allocate a heap chunk using `mmap` when the first object is allocated, |
| 96 | +and it will use this heap chunk as long as there's space for further objects allocated in the same thread. |
| 97 | +When the heap chunk is full, |
| 98 | +the thread will request a new heap chunk using `mmap`. |
| 99 | +This pattern is called thread-local allocation. |
| 100 | +HotSpot has a similar feature as well, |
| 101 | +but there the heap chunk size is dynamically computed depending on multiple factors, |
| 102 | +while on SubstrateVM the size is fixed. |
| 103 | +At the time of writing, the default heap chunk size is 1MB, |
| 104 | +but an upcoming change will make it 512KB instead. |
| 105 | + |
| 106 | +We describe the object allocation as fast when the thread has a heap chunk cached and there's enough space for the object. |
| 107 | +We describe the object allocation as slow when the thread needs to request a new heap chunk via the `mmap` system call to satisfy the object allocation. |
| 108 | +Slow allocations are the most interesting in this case, |
| 109 | +because they give us a rough indication of which allocations are pushing the number of heap chunks up, |
| 110 | +and therefore are pushing the RSS consumption up. |
| 111 | +In SubstrateVM, stacktraces of the slow allocation path contain invocations to methods defined in the `com.oracle.svm.core.genscavenge.ThreadLocalAllocation` class that contain the `slowPathNew` prefix in their name . |
| 112 | +E.g. `slowPathNewInstance` for plain objects or `slowPathNewArray` for arrays. |
| 113 | + |
| 114 | +The areas of interest in the flamegraphs are those stacktraces that contain `ApplicationImpl.doStart()` method call, |
| 115 | +because that's the method that is responsible for starting up Quarkus applications. |
| 116 | +We want to find out how many slow path allocations are visible in these stacktraces, |
| 117 | +which are executed by the `main` thread. |
| 118 | +Also, by looking at the stacktraces that cause these slow path allocations, |
| 119 | +we can get an idea of which components might be allocation heavy. |
| 120 | +If you focus the flamegraph on that method, |
| 121 | +and then click on `Search` at the top right and type `slowPathNew`, |
| 122 | +you can observe which of the `sys_enter_mmap` system calls are related to native heap memory allocation. |
| 123 | + |
| 124 | +For Quarkus 2.7.6.Final, the flamegraph looks like this: |
| 125 | + |
| 126 | +image::rss-before-mmap.svg[startup rss startup flamegraph for Quarkus 2.7.6.Final] |
| 127 | + |
| 128 | +And here's a screenshot highlighting the stacktraces that contain slow path allocations: |
| 129 | + |
| 130 | +image::rss-before-slowPathNew.png[slow path new allocations for Quarkus 2.7.6.Final] |
| 131 | + |
| 132 | +Observing the number of stacktraces that contain `slowPathNew` invocations above, i.e. 2, we can say that the `main` thread in Quarkus 2.7.6.Final allocates roughly 2MB of heap chunks. |
| 133 | + |
| 134 | +For Quarkus 2.13.4.Final, the flamegraph looks like this: |
| 135 | + |
| 136 | +image::rss-after-mmap.svg[startup rss startup flamegraph for Quarkus 2.13.4.Final] |
| 137 | + |
| 138 | +Focusing on the same area, we observe that the `slowPathNew` stacktrace count is 5 in this case, so the `main` thread allocates roughly 5MB of heap chunks: |
| 139 | + |
| 140 | +image::rss-after-slowPathNew.png[slow path new allocations for Quarkus 2.13.4.Final] |
| 141 | + |
| 142 | +The increase in object allocation at runtime startup between these two Quarkus versions, |
| 143 | +combined with the idiosyncrasies of how thread local allocation works in SubstrateVM, |
| 144 | +is what was causing the increase in RSS consumption. |
| 145 | +In other words, even though both HotSpot and SubstrateVM experience higher allocations, |
| 146 | +SubstrateVM's fixed thread-local allocation heap chunk sizes make this more apparent, |
| 147 | +compared to the dynamic thread-local allocation logic in HotSpot. |
| 148 | + |
| 149 | +Since then, improvements were made to the components that caused the extra object allocation. |
| 150 | +We were able to detect the changes that had caused these extra allocations |
| 151 | +by looking at the components that appeared in the stacktrace leading to slow allocations, |
| 152 | +and inspecting updates that had happened in those components. |
| 153 | +In this case, updates to the SmallRye configuration handling at startup caused the regression. |
| 154 | + |
| 155 | +[NOTE] |
| 156 | +==== |
| 157 | +The stacktraces that cause slow object allocations give us an approximation on the components that cause memory usage increase. |
| 158 | +There could be situations where the stacktrace leading to a slow allocation is perfectly valid, |
| 159 | +but due to the non-deterministic nature of object allocation, |
| 160 | +the stacktrace just happens to be the one that caused a new heap chunk to be allocated. |
| 161 | +However, if you see multiple stacktraces in the same area, |
| 162 | +that's probably a good hint that something in that area needs to be investigated. |
| 163 | +==== |
| 164 | + |
| 165 | +By Quarkus version 2.13.7.Final, the RSS consumption on startup was back to previous levels. |
| 166 | +The RSS regression was introduced by the changes in |
| 167 | +https://github.com/quarkusio/quarkus/pull/26802#issue-1308798216[PR #26802], |
| 168 | +and the fixes in https://github.com/quarkusio/quarkus/pull/29408[PR #29408] |
| 169 | +and https://github.com/quarkusio/quarkus/pull/29842[PR #29842] |
| 170 | +brought it back down. |
| 171 | + |
| 172 | +== Performing the analysis with `ObjectAllocationInNewTLAB` JFR event |
| 173 | + |
| 174 | +As newer versions of GraalVM implement https://github.com/oracle/graal/issues/5410[more and more JFR events] in native mode, |
| 175 | +it also becomes easier to analyse Quarkus application performance. |
| 176 | +For instance the recent release of GraalVM for JDK 17/20 adds support for the `jdk.ObjectAllocationInNewTLAB` event, |
| 177 | +which would have highlighted the discrepancies in the Quarkus versions compared above. |
| 178 | +Unfortunately, the Quarkus versions tested here are not compatible with this GraalVM version, |
| 179 | +but we can give it a go with the latest Quarkus version to see what comes out. |
| 180 | + |
| 181 | +First, we build the native Quarkus application with JFR support: |
| 182 | + |
| 183 | +[source,bash] |
| 184 | +---- |
| 185 | +$ ./mvnw package -DskipTests -Dquarkus.native.monitoring=jfr -Dnative |
| 186 | +---- |
| 187 | + |
| 188 | +To obtain `jdk.ObjectAllocationInNewTLAB` events, |
| 189 | +a custom JFC configuration file is required. |
| 190 | +We use `jfr configure` to generate it: |
| 191 | + |
| 192 | +[source,bash] |
| 193 | +---- |
| 194 | +$ $JAVA_HOME/bin/jfr configure jdk.ObjectAllocationInNewTLAB#enabled=true --output newtlab.jfc |
| 195 | +---- |
| 196 | + |
| 197 | +Then, we start the Quarkus native executable with the necessary JFR flags: |
| 198 | + |
| 199 | +[source,bash] |
| 200 | +---- |
| 201 | +$ getting-started/target/getting-started-1.0.0-SNAPSHOT-runner -XX:+FlightRecorder -XX:StartFlightRecording=settings=newtlab.jfc,filename=recording.jfr -Xmx64m |
| 202 | +... |
| 203 | +2023-07-21 12:25:33,739 INFO [io.quarkus] (main) getting-started 1.0.0-SNAPSHOT native (powered by Quarkus 3.2.1.Final) started in 0.019s. Listening on: http://0.0.0.0:8080 |
| 204 | +---- |
| 205 | + |
| 206 | +After shutting down Quarkus, |
| 207 | +we can use https://adoptium.net/jmc[Adoptium's Eclipse Mission Control] |
| 208 | +to visualize a flamegraph for all the `jdk.ObjectAllocationInNewTLAB` events. |
| 209 | +We can also do a similar focus on `ApplicationImpl.doStart()` method to observe which slow path allocations get triggered from there: |
| 210 | + |
| 211 | +image::jmc-new-tlab-event-from-doStart.png[new tlab events from ApplicationImpl.doStart() method Quarkus 3.2.1.Final] |
| 212 | + |
| 213 | +Looking at the flamegraph, |
| 214 | +we can count 7 stacktraces for slow path allocations for the `main` thread. |
| 215 | +Whether this is a problem, |
| 216 | +or maybe code execution has shifted from another thread to `main`, |
| 217 | +would be something we would need to explore further and see if there's a regression. |
| 218 | + |
| 219 | +If you encounter memory issues in your native applications, |
| 220 | +or any other type of performance issues, |
| 221 | +do not hesitate giving JFR a go. |
| 222 | +Robert Toyonaga has written a couple of blog posts on the latest JFR additions, |
| 223 | +see link:/blog/profile-and-monitor-native-executables-with-jfr[here] to find out more. |
0 commit comments