Publish abseil.io/fast/7 on productivity metrics.

aalexand · aalexand · commit f343ff87b300 · 2023-09-15T10:28:52.000-07:00
diff --git a/_posts/2023-03-02-fast-39.md b/_posts/2023-03-02-fast-39.md
@@ -143,8 +143,8 @@ This prefetch appears to be extraordinarily costly: Microbenchmarks measuring
 allocation performance show potential savings if it were removed and Google-Wide
 Profiling shows 70%+ of cycles in `new`'s fastpath on the prefetch. Removing it
 would "reduce" the [data center tax](https://research.google/pubs/pub44271.pdf),
-but we would actually hurt application productivity-per-CPU. Time we spend in
-malloc is
+but we would actually hurt [application productivity](/fast/7)-per-CPU. Time we
+spend in malloc is
 [less important than application performance](https://research.google/pubs/pub50370.pdf).
 
 Trace-driven simulations with hardware-validated architectural simulators showed
diff --git a/_posts/2023-09-14-fast-7.md b/_posts/2023-09-14-fast-7.md
@@ -0,0 +1,115 @@
+---
+title: "Performance Tip of the Week #7: Optimizing for application productivity"
+layout: fast
+sidenav: side-nav-fast.html
+published: true
+permalink: fast/7
+type: markdown
+order: "007"
+---
+
+Originally posted as Fast TotW #7 on June 6, 2019
+
+*By [Chris Kennelly](mailto:ckennelly@google.com)*
+
+Updated 2023-09-14
+
+Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7)
+
+
+## Overview
+
+Google manages a vast fleet of servers to handle search queries, process
+records, and transcode cat videos. We don't buy those servers to allocate memory
+in TCMalloc, put protocol buffers into other protocol buffers, or to handle
+branch mispredictions by our processors.
+
+To make our fleet more efficient, we want to optimize for how productive our
+servers are, that is, how much useful work they accomplish per CPU-second, byte
+of RAM, disk IOPS, or by using hardware accelerators. While measuring a job's
+resource consumption is easy, it's harder to tell just how much useful work it's
+accomplishing without help.
+
+A task's CPU usage going up could mean it suffered a performance regression or
+that it's simply busier. Consider a plot of a service's CPU usage against time,
+breaking down the total CPU usage of two versions of the binary. We cannot
+determine from casual inspection what caused the increase in CPU usage, whether
+this is from an increase in workload (serving more videos per unit time) or a
+decrease in efficiency (some added, needless protocol conversion per video).
+
+To determine what is really happening we need a productivity metric which
+captures the amount of real work completed. If we know the number of cat videos
+processed we can easily determine whether we are getting more, or less, real
+work done per CPU-second (or byte of RAM, disk operation, or hardware
+accelerator time).
+
+If we do not have productivity metrics, we are faced with *entire classes of
+optimizations* that are not well-represented by existing metrics:
+
+*   **Application speedups through core library changes**:
+
+    As seen in our [2021 OSDI paper](https://research.google/pubs/pub50370/),
+    "one classical approach is to increase the efficiency of an allocator to
+    minimize the cycles spent in the allocator code. However, memory allocation
+    decisions also impact overall application performance via data placement,
+    offering opportunities to improve fleetwide productivity by completing more
+    units of application work using fewer hardware resources."
+
+    Experiments with TCMalloc's hugepage-aware allocator, also known as
+    Temeraire, have shown considerable speedups by improving application
+    performance, not time spent in TCMalloc.
+
+    We spend more *relative* time in TCMalloc but greatly improve application
+    performance. Focusing just on relative time in TCMalloc would produce an
+    error in sign: We'd deprioritize (or even rollback) a strongly positive
+    optimization.
+
+*   Allocating more protocol buffer messages on
+    [Arenas](https://protobuf.dev/reference/cpp/arenas/) speeds up not just the
+    protocol buffer code itself (like message destructors), but also in the
+    business logic that uses them. Enabling Arenas in major frameworks allowed
+    them to process 15-30% more work per CPU, but protobuf destructor costs were
+    a small fraction of this cost. The improvements in data locality could
+    produce outsized benefits for the entire application.
+
+*   **New instruction sets**: With successive hardware generations, vendors have
+    added new instrutions to their ISAs.
+
+    In future hardware generations, we expect to replace calls to memcpy with
+    microcode-optimized `rep movsb` instructions that are faster than any
+    handwritten assembly sequence we can come up with. We expect `rep movsb` to
+    have low IPC: It's a single instruction that replaces an entire copy loop of
+    instructions!
+
+    Using these new instructions can be triggered by optimizing the source code
+    or through compiler enhancements that improve vectorization.
+
+    Focusing on MIPS or IPC would cause us to prefer any implementation that
+    executes a large number of instructions, even if those instructions take
+    longer to execute to copy `n` bytes.
+
+    In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with
+    `--march=haswell` shows a MIPS regression while simultaneously improving
+    *application productivity improvement*. These instructions can do more work
+    per instruction, however, replacing several low latency instructions may
+    mean that *average* instruction latency increases. If we had 10 million
+    instructions and 10 ms per query, we may now have 8 million instructions
+    taking only 9 ms per query. QPS is up and MIPS would go down.
+
+    Since Google's fleet runs on a wide variety of architectures, we cannot
+    easily compare instructions across platforms and need to instead compare
+    useful work accomplished by an application.
+
+*   **Compiler optimizations**: Compiler optimizations can significantly affect
+    the number of dynamically executed instructions. Techniques such as inlining
+    reduce function preambles and enable further simplifying optimizations.
+    Thus, *fewer* instructions translate to *faster*, *more productive* code.
+
+*   **Kernel optimizations**: The kernel has many policies around hugepages,
+    thread scheduling, and other system parameters. While changing these
+    policies may make the kernel nominally more costly, for example, if we did
+    more work to compact memory, the application benefits can easily outweigh
+    them.
+
+Availability of these metrics help infrastructure and efficiency teams guide
+their work more effectively.