Skip to content

Commit f343ff8

Browse files
committed
Publish abseil.io/fast/7 on productivity metrics.
1 parent af238cc commit f343ff8

File tree

2 files changed

+117
-2
lines changed

2 files changed

+117
-2
lines changed

_posts/2023-03-02-fast-39.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -143,8 +143,8 @@ This prefetch appears to be extraordinarily costly: Microbenchmarks measuring
143143
allocation performance show potential savings if it were removed and Google-Wide
144144
Profiling shows 70%+ of cycles in `new`'s fastpath on the prefetch. Removing it
145145
would "reduce" the [data center tax](https://research.google/pubs/pub44271.pdf),
146-
but we would actually hurt application productivity-per-CPU. Time we spend in
147-
malloc is
146+
but we would actually hurt [application productivity](/fast/7)-per-CPU. Time we
147+
spend in malloc is
148148
[less important than application performance](https://research.google/pubs/pub50370.pdf).
149149

150150
Trace-driven simulations with hardware-validated architectural simulators showed

_posts/2023-09-14-fast-7.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
---
2+
title: "Performance Tip of the Week #7: Optimizing for application productivity"
3+
layout: fast
4+
sidenav: side-nav-fast.html
5+
published: true
6+
permalink: fast/7
7+
type: markdown
8+
order: "007"
9+
---
10+
11+
Originally posted as Fast TotW #7 on June 6, 2019
12+
13+
*By [Chris Kennelly](mailto:[email protected])*
14+
15+
Updated 2023-09-14
16+
17+
Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7)
18+
19+
20+
## Overview
21+
22+
Google manages a vast fleet of servers to handle search queries, process
23+
records, and transcode cat videos. We don't buy those servers to allocate memory
24+
in TCMalloc, put protocol buffers into other protocol buffers, or to handle
25+
branch mispredictions by our processors.
26+
27+
To make our fleet more efficient, we want to optimize for how productive our
28+
servers are, that is, how much useful work they accomplish per CPU-second, byte
29+
of RAM, disk IOPS, or by using hardware accelerators. While measuring a job's
30+
resource consumption is easy, it's harder to tell just how much useful work it's
31+
accomplishing without help.
32+
33+
A task's CPU usage going up could mean it suffered a performance regression or
34+
that it's simply busier. Consider a plot of a service's CPU usage against time,
35+
breaking down the total CPU usage of two versions of the binary. We cannot
36+
determine from casual inspection what caused the increase in CPU usage, whether
37+
this is from an increase in workload (serving more videos per unit time) or a
38+
decrease in efficiency (some added, needless protocol conversion per video).
39+
40+
To determine what is really happening we need a productivity metric which
41+
captures the amount of real work completed. If we know the number of cat videos
42+
processed we can easily determine whether we are getting more, or less, real
43+
work done per CPU-second (or byte of RAM, disk operation, or hardware
44+
accelerator time).
45+
46+
If we do not have productivity metrics, we are faced with *entire classes of
47+
optimizations* that are not well-represented by existing metrics:
48+
49+
* **Application speedups through core library changes**:
50+
51+
As seen in our [2021 OSDI paper](https://research.google/pubs/pub50370/),
52+
"one classical approach is to increase the efficiency of an allocator to
53+
minimize the cycles spent in the allocator code. However, memory allocation
54+
decisions also impact overall application performance via data placement,
55+
offering opportunities to improve fleetwide productivity by completing more
56+
units of application work using fewer hardware resources."
57+
58+
Experiments with TCMalloc's hugepage-aware allocator, also known as
59+
Temeraire, have shown considerable speedups by improving application
60+
performance, not time spent in TCMalloc.
61+
62+
We spend more *relative* time in TCMalloc but greatly improve application
63+
performance. Focusing just on relative time in TCMalloc would produce an
64+
error in sign: We'd deprioritize (or even rollback) a strongly positive
65+
optimization.
66+
67+
* Allocating more protocol buffer messages on
68+
[Arenas](https://protobuf.dev/reference/cpp/arenas/) speeds up not just the
69+
protocol buffer code itself (like message destructors), but also in the
70+
business logic that uses them. Enabling Arenas in major frameworks allowed
71+
them to process 15-30% more work per CPU, but protobuf destructor costs were
72+
a small fraction of this cost. The improvements in data locality could
73+
produce outsized benefits for the entire application.
74+
75+
* **New instruction sets**: With successive hardware generations, vendors have
76+
added new instrutions to their ISAs.
77+
78+
In future hardware generations, we expect to replace calls to memcpy with
79+
microcode-optimized `rep movsb` instructions that are faster than any
80+
handwritten assembly sequence we can come up with. We expect `rep movsb` to
81+
have low IPC: It's a single instruction that replaces an entire copy loop of
82+
instructions!
83+
84+
Using these new instructions can be triggered by optimizing the source code
85+
or through compiler enhancements that improve vectorization.
86+
87+
Focusing on MIPS or IPC would cause us to prefer any implementation that
88+
executes a large number of instructions, even if those instructions take
89+
longer to execute to copy `n` bytes.
90+
91+
In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with
92+
`--march=haswell` shows a MIPS regression while simultaneously improving
93+
*application productivity improvement*. These instructions can do more work
94+
per instruction, however, replacing several low latency instructions may
95+
mean that *average* instruction latency increases. If we had 10 million
96+
instructions and 10 ms per query, we may now have 8 million instructions
97+
taking only 9 ms per query. QPS is up and MIPS would go down.
98+
99+
Since Google's fleet runs on a wide variety of architectures, we cannot
100+
easily compare instructions across platforms and need to instead compare
101+
useful work accomplished by an application.
102+
103+
* **Compiler optimizations**: Compiler optimizations can significantly affect
104+
the number of dynamically executed instructions. Techniques such as inlining
105+
reduce function preambles and enable further simplifying optimizations.
106+
Thus, *fewer* instructions translate to *faster*, *more productive* code.
107+
108+
* **Kernel optimizations**: The kernel has many policies around hugepages,
109+
thread scheduling, and other system parameters. While changing these
110+
policies may make the kernel nominally more costly, for example, if we did
111+
more work to compact memory, the application benefits can easily outweigh
112+
them.
113+
114+
Availability of these metrics help infrastructure and efficiency teams guide
115+
their work more effectively.

0 commit comments

Comments
 (0)