Merge pull request aws#347 from lrbison/fp_pmu

lrbison · web-flow · commit 86689bc8a8e3 · 2023-11-30T11:24:44.000-06:00
Floating Point PMUs for Performance Runbook
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ This repository provides technical guidance for users and developers using [Amaz
 * [Spark on Graviton](DataAnalytics.md)
 * [Known issues and workarounds](#known-issues-and-workarounds)
 * [AWS Managed Services available on Graviton](managed_services.md)
-* [Graviton Performance Runbook](perfrunbook/graviton_perfrunbook.md)
+* [Graviton Performance Runbook](perfrunbook/README.md)
 * [Assembly Optimization Guide for Graviton Arm64 Processors](arm64-assembly-optimization.md)
 * [Additional resources](#additional-resources)
 * [How To Resources](howtoresources.md)
@@ -62,7 +62,7 @@ If you are new to Graviton and want to understand how to identify target workloa
 |DDR Encryption	|yes	|yes	|
 
 # Optimizing for Graviton
-Please refer to [optimizing](optimizing.md) for general debugging and profiling information.  For detailed checklists on optimizing and debugging performance on Graviton, see our [performance runbook](perfrunbook/graviton_perfrunbook.md).
+Please refer to [optimizing](optimizing.md) for general debugging and profiling information.  For detailed checklists on optimizing and debugging performance on Graviton, see our [performance runbook](perfrunbook/README.md).
 
 Different architectures and systems have differing capabilities, which means some tools you might be familiar with on one architecture don't have equivalent on AWS Graviton. Documented [Monitoring Tools](Monitoring_Tools_on_Graviton.md) with some of these utilities.
 
diff --git a/perfrunbook/README.md b/perfrunbook/README.md
diff --git a/perfrunbook/appendix.md b/perfrunbook/appendix.md
@@ -1,6 +1,6 @@
 # Appendix:
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 This Appendix contains additional information for engineers that want to go deeper on a particular topic, such as using different PMU counters to understand how the code is executing on the hardware, discussion on load generators, and additional tools to help with code observability.
 
diff --git a/perfrunbook/configuring_your_loadgen.md b/perfrunbook/configuring_your_loadgen.md
@@ -1,6 +1,6 @@
 # Configuring your load generator
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 The load generator setup is important to understand and verify: it generates the load that is expected.  An unknown load-generation setup can lead to not measuring the expected experiment and getting results that are hard to interpret. Below is a checklist to step through and verify the load generator is working as expected.
 
diff --git a/perfrunbook/configuring_your_sut.md b/perfrunbook/configuring_your_sut.md
@@ -1,6 +1,6 @@
 # Configuring your system-under-test environment
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 This section documents multiple checklists to use to verify your Graviton System-under-test (SUT) is up-to-date and as code-equivalent as possible to the systems and instances you are comparing against.   Please perform these tests on each SUT to vet your experimental setup and eliminate as many potential unknown variables as possible.
 
diff --git a/perfrunbook/debug_code_perf.md b/perfrunbook/debug_code_perf.md
@@ -1,6 +1,6 @@
 # Debugging performance — “What part of the code is slow?”
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 If after checking the system behavior with the sysstat tools the behavior of your code on the CPU is still different, then your next step is to generate code profiles. There are two primary types of profiles 
 
diff --git a/perfrunbook/debug_hw_perf.md b/perfrunbook/debug_hw_perf.md
@@ -1,6 +1,6 @@
 # Debugging performance — “What part of the hardware is slow?”
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 Sometimes, hardware, not code, is the reason for worse than expected performance. This may show up in the on-cpu profiles as every function is slightly slower on Graviton as more CPU time is consumed, but no obvious hot-spot function exists.  If this is the case, then measuring how the hardware performs can offer insight.  To do this requires counting special events in the CPU to understand which component of the CPU is bottlenecking the code from executing as fast as possible.
 
@@ -91,21 +91,30 @@ To measure the standard CPU PMU events, do the following:
   %> cd ~/aws-graviton-getting-started/perfrunbook/utilities
   # AMD (5a, 6a, and 7a) instances not supported currently.
   %> sudo python3 ./measure_aggregated_pmu_stats.py --timeout 300
-  |Ratio               |   geomean|       p10|       p50|       p90|       p95|       p99|     p99.9|      p100|
-  |ipc                 |      1.00|      0.84|      1.00|      1.13|      1.32|      2.46|      2.48|      2.48|
-  |branch-mpki         |      2.43|      1.67|      2.64|      4.74|      5.69|      7.23|      8.45|      8.45|
-  |code_sparsity       |      0.00|      0.00|      0.01|      0.02|      0.04|      0.10|      0.10|      0.10|
-  |data-l1-mpki        |     11.60|     10.29|     11.67|     14.99|     15.76|     16.94|     19.68|     19.68|
-  |inst-l1-mpki        |     13.23|     11.14|     13.47|     20.56|     25.50|     34.01|     35.12|     35.12|
-  |l2-mpki             |      7.00|      5.70|      6.62|     11.02|     13.74|     18.99|     24.56|     24.56|
-  |l3-mpki             |      1.64|      1.23|      1.47|      3.09|      3.60|     11.90|     14.61|     14.61|
-  |core-rdBw-MBs       |      0.00|      0.00|      0.00|      0.02|      0.04|      0.15|      1.17|      1.50|
-  |stall_frontend_pkc  |    384.50|    326.27|    404.82|    451.50|    475.00|    571.48|    571.98|    571.98|
-  |stall_backend_pkc   |    265.24|    230.51|    266.60|    335.77|    350.70|    384.22|    395.24|    395.24|
-  |inst-tlb-mpki       |      0.36|      0.23|      0.40|      0.65|      0.74|      1.69|      1.75|      1.75|
-  |inst-tlb-tw-pki     |      0.22|      0.14|      0.25|      0.43|      0.45|      0.53|      0.70|      0.70|
-  |data-tlb-mpki       |      2.18|      1.74|      2.01|      3.60|      4.54|      6.12|      6.19|      6.19|
-  |data-tlb-tw-pki     |      1.36|      1.10|      1.48|      1.82|      2.06|      3.01|      4.71|      4.71|
+|Ratio               |   geomean|       p10|       p50|       p90|       p95|       p99|     p99.9|      p100|
+|ipc                 |      1.81|      1.72|      1.81|      1.90|      1.93|      1.95|      1.95|      1.95|
+|branch-mpki         |      0.01|      0.01|      0.01|      0.01|      0.01|      0.02|      0.02|      0.02|
+|code_sparsity       |      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|
+|data-l1-mpki        |     11.24|     10.48|     11.25|     12.05|     12.21|     12.62|     12.62|     12.62|
+|inst-l1-mpki        |      0.08|      0.07|      0.08|      0.09|      0.10|      0.15|      0.15|      0.15|
+|l2-ifetch-mpki      |      0.06|      0.05|      0.06|      0.06|      0.07|      0.12|      0.12|      0.12|
+|l2-mpki             |      0.71|      0.66|      0.70|      0.76|      0.77|      1.03|      1.03|      1.03|
+|l3-mpki             |      0.49|      0.42|      0.49|      0.55|      0.63|      0.67|      0.67|      0.67|
+|stall_frontend_pkc  |      1.97|      1.61|      1.90|      2.49|      2.68|      5.28|      5.28|      5.28|
+|stall_backend_pkc   |    425.00|    414.64|    424.81|    433.63|    435.82|    441.57|    441.57|    441.57|
+|inst-tlb-mpki       |      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|
+|inst-tlb-tw-pki     |      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|
+|data-tlb-mpki       |      1.49|      1.23|      1.55|      1.65|      1.66|      1.78|      1.78|      1.78|
+|data-tlb-tw-pki     |      0.00|      0.00|      0.00|      0.01|      0.01|      0.01|      0.01|      0.01|
+|inst-neon-pkc       |      0.31|      0.30|      0.30|      0.31|      0.31|      0.40|      0.40|      0.40|
+|inst-scalar-fp-pkc  |      2.43|      2.37|      2.44|      2.49|      2.51|      2.52|      2.52|      2.52|
+|stall_backend_mem_pkc|     90.73|     83.97|     90.71|     97.67|     97.98|    100.98|    100.98|    100.98|
+|inst-sve-pkc        |    419.00|    409.92|    419.83|    426.79|    430.73|    433.08|    433.08|    433.08|
+|inst-sve-empty-pkc  |      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|      0.00|
+|inst-sve-full-pkc   |    180.89|    176.99|    181.24|    184.31|    185.91|    187.16|    187.16|    187.16|
+|inst-sve-partial-pkc|      2.39|      2.27|      2.38|      2.50|      2.53|      2.58|      2.58|      2.58|
+|flop-sve-pkc        |   1809.47|   1768.84|   1813.91|   1842.77|   1860.45|   1871.86|   1871.86|   1871.86|
+|flop-nonsve-pkc     |      2.48|      2.41|      2.48|      2.54|      2.56|      2.57|      2.57|      2.57|
   ```
 
 ## Top-down method to debug hardware performance
@@ -138,6 +147,19 @@ Backend stalls are caused when the CPU is unable to make forward progress execut
 6. If back-end stalls due to the cache-system and memory system are the problem, the data-set size and layout needs to be optimized.
 7. Proceed to [Section 6](./optimization_recommendation.md) to view optimization recommendations for working with a large data-set causing backend stalls.
 
+### Drill down Vectorization
+
+Vectorization is accomplished either by SVE or NEON instructions.  SVE vectorization will use 256-bit vectors on Graviton 3 processors, but the scalable nature of SVE makes both the code and binary vector-length agnostic.  NEON vectorization is always a 128-bit vector size, and does not have the predicate feature of SVE.
+
+For SVE instructions there are metrics which describe how many SVE instructions had empty, full and partially-filled SVE predicates: `inst-sve-empty-pkc`, `inst-sve-partial-pkc`, and `inst-sve-full-pkc`.  These metrics apply to all SVE instructions (loads, stores, integer, and floating-point operations).  The `pkc` term indicates the counters are in units of "per kilo cycle".
+
+A single SVE instruction can execute multiple (vectorized) floating point operations in the ALU.  These are counted individually by `flop-sve-pkc`.  For example: a single SVE `FMUL` instruction on 32-bit floats on Graviton 3's 256-bit vector will increment the `flop-sve-pkc` counter by eight because the operation is executed on the eight 32-bit floats that fit in the 256-bit vector.  Some instructions, such as `FMA` and (Fused Multiply Add) excute two floating point operations per item in the vector and increment the counter accordingly.  The `flop-sve-pkc` counter is incremented assuming a full SVE predicate.
+
+Floating point operations for NEON and scalar instructions are counted together in the `flop-nonsve-pkc` counter.  For a single NEON `FMUL` instruction on 32-bit floats, the `inst-neon-pkc` counter will increment by one, and the `flop-nonsve-pkc` counter will increment by four (the number of 32-bit floats in a 128-bit NEON register).  For a single scalar `FMUL` instruction, the `flop-nonsve-pkc` counter will increment by one.  Some instructions (e.g., Fused Multiply Add) will increment the value by two.
+
+The total number of floating-point instructions retired every 1000 cycles is `inst-scalar-fp-pkc + (inst-neon-pkc + inst-sve-pkc)*<alpha>`, where `<alpha>` is a user-provided estimate of the fraction of SVE and NEON instructions which task the floating-point ALU vs loads, stores, or integer operations.  The total number of floating-point operations executed by those instructions is `flop-nonsve-pkc + flop-sve-pkc*<beta>`, where `<beta>` is a user-provided estimate of how often the predicate was full.  To calculate a maximum expected value for these metrics, consult the ARM Software Optimization Guide and determine a FLOP per kilocycle from the instruction throughput and element size of the operation.  A code with no loads, stores, or dependencies performing `FMUL` entirely in L1 cache on 32-bit floats could theoretically observe `flop-sve-pkc` of 16,000 with SVE, or `flop-nonsve-pkc` of 16,000 with NEON SIMD, or `flop-nonsve-pkc` of 4,000 with scalar operations.
+
+A footnote to readers of the ARM architecture PMU event description: SVE floating point operations are reported by hardware in units of "floating point operations per 128-bits of vector size", however the aggregation script we provide has already accounted for the Graviton 3 vector width before reporting.
 
 ## Additonal PMUs and PMU events
 
diff --git a/perfrunbook/debug_system_perf.md b/perfrunbook/debug_system_perf.md
@@ -1,6 +1,6 @@
 # Debugging performance — “What part of the system is slow?”
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 When debugging performance, start by measuring high level system behavior to pinpoint what part of the system performs differently when compared with a control instance.  Are the CPUs being saturated or under-saturated?  Is the network or disk behaving differently than expected?  Did a mis-configuration creep in that went undetected when validating the SUT application setup?
 
diff --git a/perfrunbook/defining_your_benchmark.md b/perfrunbook/defining_your_benchmark.md
@@ -1,6 +1,6 @@
 # Defining your benchmark
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 To define a benchmark there are two things to consider, the software running on the System-under-test (SUT) and how to drive load.  We recommend the software running on the SUT should be your production application. There is no better benchmark to predict performance than the actual production code.  If a synthetic proxy must be used to break dependencies of your application on external services such as authentication layers, then that proxy should be derived from the production code as much as possible.  We recommend avoiding synthetic benchmarks not related to the production code.  They are generally poor at predicting performance for another application or helping optimize it as they can over-target specific attributes of a system or exercise different bottlenecks than your application code might.
 
diff --git a/perfrunbook/intro_to_benchmarking.md b/perfrunbook/intro_to_benchmarking.md
@@ -1,6 +1,6 @@
 # Quick introduction to benchmarking
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 When designing an experiment to benchmark Graviton2 against another instance type, it is key to remember the below 2 guiding principles:
 
diff --git a/perfrunbook/optimization_recommendation.md b/perfrunbook/optimization_recommendation.md
@@ -1,6 +1,6 @@
 # Optimizing performance
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 This section describes multiple different optimization suggestions to try on Graviton2 instances to attain higher performance for your service.  Each sub-section defines some optimization recommendations that can help improve performance if you see a particular signature after measuring the performance using the previous checklists.
 
diff --git a/perfrunbook/references.md b/perfrunbook/references.md
@@ -1,6 +1,6 @@
 # References
 
-[Graviton Performance Runbook toplevel](./graviton_perfrunbook.md)
+[Graviton Performance Runbook toplevel](./README.md)
 
 Experimental design:
 
diff --git a/perfrunbook/utilities/measure_aggregated_pmu_stats.py b/perfrunbook/utilities/measure_aggregated_pmu_stats.py
@@ -156,15 +156,15 @@ def __init__(self, name, counter1, counter2, scale):
         super().__init__("armv8_pmuv3_0", name, counter1, counter2, scale)
 
 class ArmCounterPKC(ArmCounterConfig):
-    def __init__(self, name, event_name, event):
+    def __init__(self, name, event_name, event, scale=1):
         super().__init__(name, PMUEventCounter(event_name, event),
-                         PMUEventCounter("cycles", "event=0x11"), 1000)
+                         PMUEventCounter("cycles", "event=0x11"), scale*1000)
 
 
 class ArmCounterPKI(ArmCounterConfig):
-    def __init__(self, name, event_name, event):
+    def __init__(self, name, event_name, event, scale=1):
         super().__init__(name, PMUEventCounter(event_name, event),
-                         PMUEventCounter("instructions", "event=0x8"), 1000)
+                         PMUEventCounter("instructions", "event=0x8"), scale*1000)
 
 class ArmCMNCounterConfig(CounterConfig):
     def __init__(self, name, counter1, counter2, scale):
@@ -177,14 +177,14 @@ def __init__(self, name, counter1, counter2, scale):
 
 
 class x86CounterPKI(x86CounterConfig):
-    def __init__(self, name, event_name, event):
+    def __init__(self, name, event_name, event, scale=1):
         super().__init__(name, PMUEventCounter(event_name, event),
-                         PMUEventCounter("instructions", "event=0xc0,umask=0x0"), 1000)
+                         PMUEventCounter("instructions", "event=0xc0,umask=0x0"), scale*1000)
 
 class IntelCounterPKC(x86CounterConfig):
-    def __init__(self, name, event_name, event):
+    def __init__(self, name, event_name, event, scale=1):
         super().__init__(name, PMUEventCounter(event_name, event),
-                         PMUEventCounter("cycles", "event=0x3c,umask=0x0"), 1000)
+                         PMUEventCounter("cycles", "event=0x3c,umask=0x0"), scale*1000)
 
 
 class AMDCounterPKC(x86CounterConfig):
@@ -505,9 +505,22 @@ def get_counters(self) -> list:
         ArmCounterPKI("inst-tlb-tw-pki", "inst_tlb_walk", "event=0x35"),
         ArmCounterPKI("data-tlb-mpki", "data_tlb_refill", "event=0x5"),
         ArmCounterPKI("data-tlb-tw-pki", "data_tlb_walk", "event=0x34"),
+        ArmCounterPKC("inst-neon-pkc", "ASE_SPEC", "event=0x74"),
+        ArmCounterPKC("inst-scalar-fp-pkc", "VFP_SPEC", "event=0x75"),
+
     ],
+
     "Graviton3": [
         ArmCounterPKC("stall_backend_mem_pkc", "stall_backend_mem_cycles", "event=0x4005"),
+        ArmCounterPKC("inst-sve-pkc", "SVE_INST_SPEC", "event=0x8006"),
+        ArmCounterPKC("inst-sve-empty-pkc", "SVE_PRED_EMPTY_SPEC", "event=0x8075"),
+        ArmCounterPKC("inst-sve-full-pkc", "SVE_PRED_FULL_SPEC", "event=0x8076"),
+        ArmCounterPKC("inst-sve-partial-pkc", "SVE_PRED_PARTIAL_SPEC", "event=0x8077"),
+        # SCALE OPS: number of SVE ops, counting size of vector
+        # See The A-profile achitecture reference manual (DDI 0487J.a) in Sec D12.11.1 tells us these are in ALU operations per 128-bits,
+        ArmCounterPKC("flop-sve-pkc", "FP_SCALE_OPS_SPEC", "event=0x80C0", scale=256/128),
+        # FP FIXED OPS: number of NEON and Scalar ops, counting NEON vector width (128-bit)
+        ArmCounterPKC("flop-nonsve-pkc", "FP_FIXED_OPS_SPEC", "event=0x80C1"),
     ],
     "CMN": [
         ArmCMNCounterConfig("DDR-BW-MBps", PMUEventCounter("hnf_mc_reqs", "type=0x5,eventid=0xd", per_cpu=False), None, (64.0 / 1024.0 / 1024.0 / SAMPLE_INTERVAL)),
@@ -536,6 +549,15 @@ def get_counters(self) -> list:
         x86CounterPKI("inst-l1-mpki", "l2_inst_ifetch", "event=0x24,umask=0xe4"),
         x86CounterPKI("l3-mpki", "longest_lat_cache_miss", "event=0x2e,umask=0x41"),
         x86CounterConfig("core-rdBw-MBs", PMUEventCounter("longest_lat_cache_miss", "event=0x2e,umask=0x41"), None, (64.0 / 1024.0 / 1024.0 / SAMPLE_INTERVAL)),  
+        IntelCounterPKC("flop-scalar-sp-pkc", "FP_ARITH_INST_RETIRED.SCALAR_SINGLE", "event=0xc7,umask=0x2"),
+        IntelCounterPKC("flop-scalar-dp-pkc", "FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", "event=0xc7,umask=0x1"),
+        # we'll scale these so they count total flops, rather than packed flops:
+        IntelCounterPKC("flop-128b-sp-pkc", "FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE", "event=0xc7,umask=0x8", scale=128/32),
+        IntelCounterPKC("flop-256b-sp-pkc", "FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE", "event=0xc7,umask=0x20", scale=256/32),
+        IntelCounterPKC("flop-512b-sp-pkc", "FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE", "event=0xc7,umask=0x80", scale=512/32),
+        IntelCounterPKC("flop-128b-dp-pkc", "FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE", "event=0xc7,umask=0x4", scale=128/64),
+        IntelCounterPKC("flop-256b-dp-pkc", "FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE", "event=0xc7,umask=0x10", scale=256/64),
+        IntelCounterPKC("flop-512b-dp-pkc", "FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE", "event=0xc7,umask=0x40", scale=512/64),
     ],
     "Intel_SKX_CXL" : [
         x86CounterPKI("l2-mpki", "l2_fills", "event=0xf1,umask=0x1f"), 
diff --git a/transition-guide.md b/transition-guide.md