Skip to content

Commit 86689bc

Browse files
authored
Merge pull request aws#347 from lrbison/fp_pmu
Floating Point PMUs for Performance Runbook
2 parents a920246 + 303d65e commit 86689bc

14 files changed

+80
-36
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ This repository provides technical guidance for users and developers using [Amaz
2929
* [Spark on Graviton](DataAnalytics.md)
3030
* [Known issues and workarounds](#known-issues-and-workarounds)
3131
* [AWS Managed Services available on Graviton](managed_services.md)
32-
* [Graviton Performance Runbook](perfrunbook/graviton_perfrunbook.md)
32+
* [Graviton Performance Runbook](perfrunbook/README.md)
3333
* [Assembly Optimization Guide for Graviton Arm64 Processors](arm64-assembly-optimization.md)
3434
* [Additional resources](#additional-resources)
3535
* [How To Resources](howtoresources.md)
@@ -62,7 +62,7 @@ If you are new to Graviton and want to understand how to identify target workloa
6262
|DDR Encryption |yes |yes |
6363

6464
# Optimizing for Graviton
65-
Please refer to [optimizing](optimizing.md) for general debugging and profiling information. For detailed checklists on optimizing and debugging performance on Graviton, see our [performance runbook](perfrunbook/graviton_perfrunbook.md).
65+
Please refer to [optimizing](optimizing.md) for general debugging and profiling information. For detailed checklists on optimizing and debugging performance on Graviton, see our [performance runbook](perfrunbook/README.md).
6666

6767
Different architectures and systems have differing capabilities, which means some tools you might be familiar with on one architecture don't have equivalent on AWS Graviton. Documented [Monitoring Tools](Monitoring_Tools_on_Graviton.md) with some of these utilities.
6868

File renamed without changes.

perfrunbook/appendix.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Appendix:
22

3-
[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
3+
[Graviton Performance Runbook toplevel](./README.md)
44

55
This Appendix contains additional information for engineers that want to go deeper on a particular topic, such as using different PMU counters to understand how the code is executing on the hardware, discussion on load generators, and additional tools to help with code observability.
66

perfrunbook/configuring_your_loadgen.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Configuring your load generator
22

3-
[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
3+
[Graviton Performance Runbook toplevel](./README.md)
44

55
The load generator setup is important to understand and verify: it generates the load that is expected. An unknown load-generation setup can lead to not measuring the expected experiment and getting results that are hard to interpret. Below is a checklist to step through and verify the load generator is working as expected.
66

perfrunbook/configuring_your_sut.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Configuring your system-under-test environment
22

3-
[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
3+
[Graviton Performance Runbook toplevel](./README.md)
44

55
This section documents multiple checklists to use to verify your Graviton System-under-test (SUT) is up-to-date and as code-equivalent as possible to the systems and instances you are comparing against. Please perform these tests on each SUT to vet your experimental setup and eliminate as many potential unknown variables as possible.
66

perfrunbook/debug_code_perf.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Debugging performance — “What part of the code is slow?”
22

3-
[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
3+
[Graviton Performance Runbook toplevel](./README.md)
44

55
If after checking the system behavior with the sysstat tools the behavior of your code on the CPU is still different, then your next step is to generate code profiles. There are two primary types of profiles
66

perfrunbook/debug_hw_perf.md

Lines changed: 38 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Debugging performance — “What part of the hardware is slow?”
22

3-
[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
3+
[Graviton Performance Runbook toplevel](./README.md)
44

55
Sometimes, hardware, not code, is the reason for worse than expected performance. This may show up in the on-cpu profiles as every function is slightly slower on Graviton as more CPU time is consumed, but no obvious hot-spot function exists. If this is the case, then measuring how the hardware performs can offer insight. To do this requires counting special events in the CPU to understand which component of the CPU is bottlenecking the code from executing as fast as possible.
66

@@ -91,21 +91,30 @@ To measure the standard CPU PMU events, do the following:
9191
%> cd ~/aws-graviton-getting-started/perfrunbook/utilities
9292
# AMD (5a, 6a, and 7a) instances not supported currently.
9393
%> sudo python3 ./measure_aggregated_pmu_stats.py --timeout 300
94-
|Ratio | geomean| p10| p50| p90| p95| p99| p99.9| p100|
95-
|ipc | 1.00| 0.84| 1.00| 1.13| 1.32| 2.46| 2.48| 2.48|
96-
|branch-mpki | 2.43| 1.67| 2.64| 4.74| 5.69| 7.23| 8.45| 8.45|
97-
|code_sparsity | 0.00| 0.00| 0.01| 0.02| 0.04| 0.10| 0.10| 0.10|
98-
|data-l1-mpki | 11.60| 10.29| 11.67| 14.99| 15.76| 16.94| 19.68| 19.68|
99-
|inst-l1-mpki | 13.23| 11.14| 13.47| 20.56| 25.50| 34.01| 35.12| 35.12|
100-
|l2-mpki | 7.00| 5.70| 6.62| 11.02| 13.74| 18.99| 24.56| 24.56|
101-
|l3-mpki | 1.64| 1.23| 1.47| 3.09| 3.60| 11.90| 14.61| 14.61|
102-
|core-rdBw-MBs | 0.00| 0.00| 0.00| 0.02| 0.04| 0.15| 1.17| 1.50|
103-
|stall_frontend_pkc | 384.50| 326.27| 404.82| 451.50| 475.00| 571.48| 571.98| 571.98|
104-
|stall_backend_pkc | 265.24| 230.51| 266.60| 335.77| 350.70| 384.22| 395.24| 395.24|
105-
|inst-tlb-mpki | 0.36| 0.23| 0.40| 0.65| 0.74| 1.69| 1.75| 1.75|
106-
|inst-tlb-tw-pki | 0.22| 0.14| 0.25| 0.43| 0.45| 0.53| 0.70| 0.70|
107-
|data-tlb-mpki | 2.18| 1.74| 2.01| 3.60| 4.54| 6.12| 6.19| 6.19|
108-
|data-tlb-tw-pki | 1.36| 1.10| 1.48| 1.82| 2.06| 3.01| 4.71| 4.71|
94+
|Ratio | geomean| p10| p50| p90| p95| p99| p99.9| p100|
95+
|ipc | 1.81| 1.72| 1.81| 1.90| 1.93| 1.95| 1.95| 1.95|
96+
|branch-mpki | 0.01| 0.01| 0.01| 0.01| 0.01| 0.02| 0.02| 0.02|
97+
|code_sparsity | 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|
98+
|data-l1-mpki | 11.24| 10.48| 11.25| 12.05| 12.21| 12.62| 12.62| 12.62|
99+
|inst-l1-mpki | 0.08| 0.07| 0.08| 0.09| 0.10| 0.15| 0.15| 0.15|
100+
|l2-ifetch-mpki | 0.06| 0.05| 0.06| 0.06| 0.07| 0.12| 0.12| 0.12|
101+
|l2-mpki | 0.71| 0.66| 0.70| 0.76| 0.77| 1.03| 1.03| 1.03|
102+
|l3-mpki | 0.49| 0.42| 0.49| 0.55| 0.63| 0.67| 0.67| 0.67|
103+
|stall_frontend_pkc | 1.97| 1.61| 1.90| 2.49| 2.68| 5.28| 5.28| 5.28|
104+
|stall_backend_pkc | 425.00| 414.64| 424.81| 433.63| 435.82| 441.57| 441.57| 441.57|
105+
|inst-tlb-mpki | 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|
106+
|inst-tlb-tw-pki | 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|
107+
|data-tlb-mpki | 1.49| 1.23| 1.55| 1.65| 1.66| 1.78| 1.78| 1.78|
108+
|data-tlb-tw-pki | 0.00| 0.00| 0.00| 0.01| 0.01| 0.01| 0.01| 0.01|
109+
|inst-neon-pkc | 0.31| 0.30| 0.30| 0.31| 0.31| 0.40| 0.40| 0.40|
110+
|inst-scalar-fp-pkc | 2.43| 2.37| 2.44| 2.49| 2.51| 2.52| 2.52| 2.52|
111+
|stall_backend_mem_pkc| 90.73| 83.97| 90.71| 97.67| 97.98| 100.98| 100.98| 100.98|
112+
|inst-sve-pkc | 419.00| 409.92| 419.83| 426.79| 430.73| 433.08| 433.08| 433.08|
113+
|inst-sve-empty-pkc | 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00|
114+
|inst-sve-full-pkc | 180.89| 176.99| 181.24| 184.31| 185.91| 187.16| 187.16| 187.16|
115+
|inst-sve-partial-pkc| 2.39| 2.27| 2.38| 2.50| 2.53| 2.58| 2.58| 2.58|
116+
|flop-sve-pkc | 1809.47| 1768.84| 1813.91| 1842.77| 1860.45| 1871.86| 1871.86| 1871.86|
117+
|flop-nonsve-pkc | 2.48| 2.41| 2.48| 2.54| 2.56| 2.57| 2.57| 2.57|
109118
```
110119

111120
## Top-down method to debug hardware performance
@@ -138,6 +147,19 @@ Backend stalls are caused when the CPU is unable to make forward progress execut
138147
6. If back-end stalls due to the cache-system and memory system are the problem, the data-set size and layout needs to be optimized.
139148
7. Proceed to [Section 6](./optimization_recommendation.md) to view optimization recommendations for working with a large data-set causing backend stalls.
140149

150+
### Drill down Vectorization
151+
152+
Vectorization is accomplished either by SVE or NEON instructions. SVE vectorization will use 256-bit vectors on Graviton 3 processors, but the scalable nature of SVE makes both the code and binary vector-length agnostic. NEON vectorization is always a 128-bit vector size, and does not have the predicate feature of SVE.
153+
154+
For SVE instructions there are metrics which describe how many SVE instructions had empty, full and partially-filled SVE predicates: `inst-sve-empty-pkc`, `inst-sve-partial-pkc`, and `inst-sve-full-pkc`. These metrics apply to all SVE instructions (loads, stores, integer, and floating-point operations). The `pkc` term indicates the counters are in units of "per kilo cycle".
155+
156+
A single SVE instruction can execute multiple (vectorized) floating point operations in the ALU. These are counted individually by `flop-sve-pkc`. For example: a single SVE `FMUL` instruction on 32-bit floats on Graviton 3's 256-bit vector will increment the `flop-sve-pkc` counter by eight because the operation is executed on the eight 32-bit floats that fit in the 256-bit vector. Some instructions, such as `FMA` and (Fused Multiply Add) excute two floating point operations per item in the vector and increment the counter accordingly. The `flop-sve-pkc` counter is incremented assuming a full SVE predicate.
157+
158+
Floating point operations for NEON and scalar instructions are counted together in the `flop-nonsve-pkc` counter. For a single NEON `FMUL` instruction on 32-bit floats, the `inst-neon-pkc` counter will increment by one, and the `flop-nonsve-pkc` counter will increment by four (the number of 32-bit floats in a 128-bit NEON register). For a single scalar `FMUL` instruction, the `flop-nonsve-pkc` counter will increment by one. Some instructions (e.g., Fused Multiply Add) will increment the value by two.
159+
160+
The total number of floating-point instructions retired every 1000 cycles is `inst-scalar-fp-pkc + (inst-neon-pkc + inst-sve-pkc)*<alpha>`, where `<alpha>` is a user-provided estimate of the fraction of SVE and NEON instructions which task the floating-point ALU vs loads, stores, or integer operations. The total number of floating-point operations executed by those instructions is `flop-nonsve-pkc + flop-sve-pkc*<beta>`, where `<beta>` is a user-provided estimate of how often the predicate was full. To calculate a maximum expected value for these metrics, consult the ARM Software Optimization Guide and determine a FLOP per kilocycle from the instruction throughput and element size of the operation. A code with no loads, stores, or dependencies performing `FMUL` entirely in L1 cache on 32-bit floats could theoretically observe `flop-sve-pkc` of 16,000 with SVE, or `flop-nonsve-pkc` of 16,000 with NEON SIMD, or `flop-nonsve-pkc` of 4,000 with scalar operations.
161+
162+
A footnote to readers of the ARM architecture PMU event description: SVE floating point operations are reported by hardware in units of "floating point operations per 128-bits of vector size", however the aggregation script we provide has already accounted for the Graviton 3 vector width before reporting.
141163

142164
## Additonal PMUs and PMU events
143165

perfrunbook/debug_system_perf.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Debugging performance — “What part of the system is slow?”
22

3-
[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
3+
[Graviton Performance Runbook toplevel](./README.md)
44

55
When debugging performance, start by measuring high level system behavior to pinpoint what part of the system performs differently when compared with a control instance. Are the CPUs being saturated or under-saturated? Is the network or disk behaving differently than expected? Did a mis-configuration creep in that went undetected when validating the SUT application setup?
66

perfrunbook/defining_your_benchmark.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Defining your benchmark
22

3-
[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
3+
[Graviton Performance Runbook toplevel](./README.md)
44

55
To define a benchmark there are two things to consider, the software running on the System-under-test (SUT) and how to drive load. We recommend the software running on the SUT should be your production application. There is no better benchmark to predict performance than the actual production code. If a synthetic proxy must be used to break dependencies of your application on external services such as authentication layers, then that proxy should be derived from the production code as much as possible. We recommend avoiding synthetic benchmarks not related to the production code. They are generally poor at predicting performance for another application or helping optimize it as they can over-target specific attributes of a system or exercise different bottlenecks than your application code might.
66

perfrunbook/intro_to_benchmarking.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Quick introduction to benchmarking
22

3-
[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
3+
[Graviton Performance Runbook toplevel](./README.md)
44

55
When designing an experiment to benchmark Graviton2 against another instance type, it is key to remember the below 2 guiding principles:
66

0 commit comments

Comments
 (0)