|
| 1 | +--- |
| 2 | +title: "Performance Tip of the Week #7: Optimizing for application productivity" |
| 3 | +layout: fast |
| 4 | +sidenav: side-nav-fast.html |
| 5 | +published: true |
| 6 | +permalink: fast/7 |
| 7 | +type: markdown |
| 8 | +order: "007" |
| 9 | +--- |
| 10 | + |
| 11 | +Originally posted as Fast TotW #7 on June 6, 2019 |
| 12 | + |
| 13 | +*By [Chris Kennelly ](mailto:[email protected])* |
| 14 | + |
| 15 | +Updated 2023-09-14 |
| 16 | + |
| 17 | +Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7) |
| 18 | + |
| 19 | + |
| 20 | +## Overview |
| 21 | + |
| 22 | +Google manages a vast fleet of servers to handle search queries, process |
| 23 | +records, and transcode cat videos. We don't buy those servers to allocate memory |
| 24 | +in TCMalloc, put protocol buffers into other protocol buffers, or to handle |
| 25 | +branch mispredictions by our processors. |
| 26 | + |
| 27 | +To make our fleet more efficient, we want to optimize for how productive our |
| 28 | +servers are, that is, how much useful work they accomplish per CPU-second, byte |
| 29 | +of RAM, disk IOPS, or by using hardware accelerators. While measuring a job's |
| 30 | +resource consumption is easy, it's harder to tell just how much useful work it's |
| 31 | +accomplishing without help. |
| 32 | + |
| 33 | +A task's CPU usage going up could mean it suffered a performance regression or |
| 34 | +that it's simply busier. Consider a plot of a service's CPU usage against time, |
| 35 | +breaking down the total CPU usage of two versions of the binary. We cannot |
| 36 | +determine from casual inspection what caused the increase in CPU usage, whether |
| 37 | +this is from an increase in workload (serving more videos per unit time) or a |
| 38 | +decrease in efficiency (some added, needless protocol conversion per video). |
| 39 | + |
| 40 | +To determine what is really happening we need a productivity metric which |
| 41 | +captures the amount of real work completed. If we know the number of cat videos |
| 42 | +processed we can easily determine whether we are getting more, or less, real |
| 43 | +work done per CPU-second (or byte of RAM, disk operation, or hardware |
| 44 | +accelerator time). |
| 45 | + |
| 46 | +If we do not have productivity metrics, we are faced with *entire classes of |
| 47 | +optimizations* that are not well-represented by existing metrics: |
| 48 | + |
| 49 | +* **Application speedups through core library changes**: |
| 50 | + |
| 51 | + As seen in our [2021 OSDI paper](https://research.google/pubs/pub50370/), |
| 52 | + "one classical approach is to increase the efficiency of an allocator to |
| 53 | + minimize the cycles spent in the allocator code. However, memory allocation |
| 54 | + decisions also impact overall application performance via data placement, |
| 55 | + offering opportunities to improve fleetwide productivity by completing more |
| 56 | + units of application work using fewer hardware resources." |
| 57 | + |
| 58 | + Experiments with TCMalloc's hugepage-aware allocator, also known as |
| 59 | + Temeraire, have shown considerable speedups by improving application |
| 60 | + performance, not time spent in TCMalloc. |
| 61 | + |
| 62 | + We spend more *relative* time in TCMalloc but greatly improve application |
| 63 | + performance. Focusing just on relative time in TCMalloc would produce an |
| 64 | + error in sign: We'd deprioritize (or even rollback) a strongly positive |
| 65 | + optimization. |
| 66 | + |
| 67 | +* Allocating more protocol buffer messages on |
| 68 | + [Arenas](https://protobuf.dev/reference/cpp/arenas/) speeds up not just the |
| 69 | + protocol buffer code itself (like message destructors), but also in the |
| 70 | + business logic that uses them. Enabling Arenas in major frameworks allowed |
| 71 | + them to process 15-30% more work per CPU, but protobuf destructor costs were |
| 72 | + a small fraction of this cost. The improvements in data locality could |
| 73 | + produce outsized benefits for the entire application. |
| 74 | + |
| 75 | +* **New instruction sets**: With successive hardware generations, vendors have |
| 76 | + added new instrutions to their ISAs. |
| 77 | + |
| 78 | + In future hardware generations, we expect to replace calls to memcpy with |
| 79 | + microcode-optimized `rep movsb` instructions that are faster than any |
| 80 | + handwritten assembly sequence we can come up with. We expect `rep movsb` to |
| 81 | + have low IPC: It's a single instruction that replaces an entire copy loop of |
| 82 | + instructions! |
| 83 | + |
| 84 | + Using these new instructions can be triggered by optimizing the source code |
| 85 | + or through compiler enhancements that improve vectorization. |
| 86 | + |
| 87 | + Focusing on MIPS or IPC would cause us to prefer any implementation that |
| 88 | + executes a large number of instructions, even if those instructions take |
| 89 | + longer to execute to copy `n` bytes. |
| 90 | + |
| 91 | + In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with |
| 92 | + `--march=haswell` shows a MIPS regression while simultaneously improving |
| 93 | + *application productivity improvement*. These instructions can do more work |
| 94 | + per instruction, however, replacing several low latency instructions may |
| 95 | + mean that *average* instruction latency increases. If we had 10 million |
| 96 | + instructions and 10 ms per query, we may now have 8 million instructions |
| 97 | + taking only 9 ms per query. QPS is up and MIPS would go down. |
| 98 | + |
| 99 | + Since Google's fleet runs on a wide variety of architectures, we cannot |
| 100 | + easily compare instructions across platforms and need to instead compare |
| 101 | + useful work accomplished by an application. |
| 102 | + |
| 103 | +* **Compiler optimizations**: Compiler optimizations can significantly affect |
| 104 | + the number of dynamically executed instructions. Techniques such as inlining |
| 105 | + reduce function preambles and enable further simplifying optimizations. |
| 106 | + Thus, *fewer* instructions translate to *faster*, *more productive* code. |
| 107 | + |
| 108 | +* **Kernel optimizations**: The kernel has many policies around hugepages, |
| 109 | + thread scheduling, and other system parameters. While changing these |
| 110 | + policies may make the kernel nominally more costly, for example, if we did |
| 111 | + more work to compact memory, the application benefits can easily outweigh |
| 112 | + them. |
| 113 | + |
| 114 | +Availability of these metrics help infrastructure and efficiency teams guide |
| 115 | +their work more effectively. |
0 commit comments