Skip to content

Commit d01ab40

Browse files
authored
Merge pull request #489 from aalexand/fast-83
Publish abseil.io/fast/{62,72,79,83}; other minor fixes.
2 parents 73fa238 + 1f2d4da commit d01ab40

10 files changed

+1233
-47
lines changed

_posts/2023-03-02-fast-39.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -159,9 +159,9 @@ There are a number of things that commonly go wrong when writing benchmarks. The
159159
following is a non-exhaustive list:
160160

161161
* Data being resident. Workloads have large footprints, a small footprint may
162-
be instruction bound, whereas the true workload could be memory bound.
163-
There's a trade-off between adding instructions to save some memory costs vs
164-
placing data in memory to save instructions.
162+
be instruction bound, whereas the true workload could be
163+
[memory bound](/fast/62). There's a trade-off between adding instructions to
164+
save some memory costs vs placing data in memory to save instructions.
165165
* Small instruction cache footprint. Google codes typically have large
166166
instruction footprints. Benchmarks are often cache resident. The `memcmp`
167167
and TCMalloc examples go directly to this.

_posts/2023-09-14-fast-7.md

Lines changed: 26 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -25,29 +25,30 @@ in TCMalloc, put protocol buffers into other protocol buffers, or to handle
2525
branch mispredictions by our processors.
2626

2727
To make our fleet more efficient, we want to optimize for how productive our
28-
servers are, that is, how much useful work they accomplish per CPU-second, byte
29-
of RAM, disk IOPS, or by using hardware accelerators. While measuring a job's
30-
resource consumption is easy, it's harder to tell just how much useful work it's
31-
accomplishing without help.
32-
33-
A task's CPU usage going up could mean it suffered a performance regression or
34-
that it's simply busier. Consider a plot of a service's CPU usage against time,
35-
breaking down the total CPU usage of two versions of the binary. We cannot
36-
determine from casual inspection what caused the increase in CPU usage, whether
37-
this is from an increase in workload (serving more videos per unit time) or a
38-
decrease in efficiency (some added, needless protocol conversion per video).
39-
40-
To determine what is really happening we need a productivity metric which
28+
servers are, that is, how much useful work they accomplish per CPU-second,
29+
byte-second of RAM, disk operation, or by using hardware accelerators. While
30+
measuring a job's resource consumption is easy, it's harder to tell just how
31+
much useful work it's accomplishing without help.
32+
33+
A task's CPU usage going up could mean the task has suffered a performance
34+
regression or that it's simply busier. Consider a plot of a service's CPU usage
35+
against time, breaking down the total CPU usage of two versions of the binary.
36+
We cannot determine from casual inspection what caused the increase in CPU
37+
usage, whether this is from an increase in workload (serving more videos per
38+
unit time) or a decrease in efficiency (some added, needless protocol conversion
39+
per video).
40+
41+
To determine what is really happening, we need a productivity metric which
4142
captures the amount of real work completed. If we know the number of cat videos
42-
processed we can easily determine whether we are getting more, or less, real
43-
work done per CPU-second (or byte of RAM, disk operation, or hardware
43+
processed, we can easily determine whether we are getting more, or less, real
44+
work done per CPU-second (or byte-second of RAM, disk operation, or hardware
4445
accelerator time). These metrics are referred to as *application productivity
4546
metrics*, or *APMs*.
4647

4748
If we do not have productivity metrics, we are faced with *entire classes of
4849
optimizations* that are not well-represented by existing metrics:
4950

50-
* **Application speedups through core library changes**:
51+
* **Application speedups through core infrastructure changes**:
5152

5253
As seen in our [2021 OSDI paper](https://research.google/pubs/pub50370/),
5354
"one classical approach is to increase the efficiency of an allocator to
@@ -79,21 +80,21 @@ optimizations* that are not well-represented by existing metrics:
7980
In future hardware generations, we expect to replace calls to memcpy with
8081
microcode-optimized `rep movsb` instructions that are faster than any
8182
handwritten assembly sequence we can come up with. We expect `rep movsb` to
82-
have low IPC: It's a single instruction that replaces an entire copy loop of
83-
instructions!
83+
have low IPC (instructions per cycle): It's a single instruction that
84+
replaces an entire copy loop of instructions!
8485

8586
Using these new instructions can be triggered by optimizing the source code
8687
or through compiler enhancements that improve vectorization.
8788

88-
Focusing on MIPS or IPC would cause us to prefer any implementation that
89-
executes a large number of instructions, even if those instructions take
90-
longer to execute to copy `n` bytes.
89+
Focusing on MIPS (millions of instructions per second) or IPC would cause us
90+
to prefer any implementation that executes a large number of instructions,
91+
even if those instructions take longer to execute to copy `n` bytes.
9192

9293
In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with
93-
`--march=haswell` shows a MIPS regression while simultaneously improving
94-
*application productivity improvement*. These instructions can do more work
95-
per instruction, however, replacing several low latency instructions may
96-
mean that *average* instruction latency increases. If we had 10 million
94+
`--march=haswell` shows a MIPS regression while simultaneously *improving
95+
application productivity*. These instructions can do more work per
96+
instruction, however, replacing several low latency instructions may mean
97+
that *average* instruction latency increases. If we had 10 million
9798
instructions and 10 ms per query, we may now have 8 million instructions
9899
taking only 9 ms per query. QPS is up and MIPS would go down.
99100

_posts/2023-09-30-fast-52.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ code base, keeping that choice optimal over time.
151151

152152
For some uses, this strategy is infeasible. `my::super_fast_string` will
153153
probably never replace `std::string` because the latter is so entrenched and the
154-
impedence mismatch of living in an independent string ecosystem exceeds the
154+
impedance mismatch of living in an independent string ecosystem exceeds the
155155
benefits. Multiple
156156
[vocabulary types](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2125r0.pdf)
157157
suffer from impedance mismatch--costly interconversions can overwhelm the

_posts/2023-10-20-fast-70.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -105,12 +105,12 @@ developing optimizations. For example,
105105
[limitations](/fast/39), but as long as we're mindful of those pitfalls,
106106
they can get us directional information much more quickly.
107107
* PMU counters can tell us rich details about [bottlenecks in code](/fast/53)
108-
such as cache misses or branch mispredictions. Seeing changes in these
109-
metrics can be a *proxy* that helps us understand the effect. For example,
110-
inserting software prefetches can reduce cache miss events, but in a memory
111-
bandwidth-bound program, the prefetches can go no faster than the "speed of
112-
light" of the memory bus. Similarly, eliminating a stall far off the
113-
critical path might have little bearing on the application's actual
108+
such as [cache misses](/fast/62) or branch mispredictions. Seeing changes in
109+
these metrics can be a *proxy* that helps us understand the effect. For
110+
example, inserting software prefetches can reduce cache miss events, but in
111+
a memory bandwidth-bound program, the prefetches can go no faster than the
112+
"speed of light" of the memory bus. Similarly, eliminating a stall far off
113+
the critical path might have little bearing on the application's actual
114114
performance.
115115

116116
If we expect to improve an application's performance, we might start by taking a

_posts/2023-11-10-fast-74.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,11 +63,12 @@ trying to reduce easily understood costs would have led to a worse outcome.
6363
### Artificial costs in TCMalloc
6464

6565
Starting from 2016, work commenced to reduce TCMalloc's cost. Much of this early
66-
work involved making things generally faster, by removing instructions, avoiding
67-
cache misses, and shortening lock critical sections.
66+
work involved making things generally faster, by removing instructions,
67+
[avoiding cache misses](/fast/62), and shortening lock critical sections.
6868

69-
During this process, a prefetch was added on its fast path. GWP even indicates
70-
that 70%+ of cycles in the `malloc` fastpath are
69+
During this process, a prefetch was added on its fast path. Our
70+
[fleet-wide profiling](https://research.google/pubs/google-wide-profiling-a-continuous-profiling-infrastructure-for-data-centers/)
71+
even indicates that 70%+ of cycles in the `malloc` fastpath are
7172
[spent on that prefetch](/fast/39)! Guided by the costs we could easily
7273
understand, we might be tempted to remove it. TCMalloc's fast path would appear
7374
cheaper, but other code somewhere else would experience a cache miss and

_posts/2023-11-10-fast-75.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -227,11 +227,11 @@ when the processor's execution
227227
## Understanding the speed of light
228228

229229
Before embarking too far on optimizing the `ParseVarint32` routine, we might
230-
want to identify the "speed of light" of the hardware. For varint parsing, this
231-
is *approximately* `memcpy`, since we are reading serialized bytes and writing
232-
the (mostly expanded) bytes into the parsed data structure. While this is not
233-
quite the operation we're interested in, it's readily available off the shelf
234-
without much effort.
230+
want to identify the ["speed of light"](/fast/72) of the hardware. For varint
231+
parsing, this is *approximately* `memcpy`, since we are reading serialized bytes
232+
and writing the (mostly expanded) bytes into the parsed data structure. While
233+
this is not quite the operation we're interested in, it's readily available off
234+
the shelf without much effort.
235235

236236
<pre class="prettyprint code">
237237
void BM_Memcpy(benchmark::State& state) {
@@ -432,9 +432,9 @@ This pattern is also used in
432432
example, it has a "hot" SwissMap benchmark that performs its operations
433433
(lookups, etc.) against a single instance and a "cold" SwissMap benchmark where
434434
we randomly pick a table on each iteration. The latter makes it more likely that
435-
we'll incur a cache miss. Hardware counters and the benchmark framework's
436-
[support for collecting them](/fast/53) can help diagnose and explain
437-
performance differences.
435+
we'll [incur a cache miss](/fast/62). Hardware counters and the benchmark
436+
framework's [support for collecting them](/fast/53) can help diagnose and
437+
explain performance differences.
438438

439439
Even though the extremes are not representative, they can help us frame how to
440440
tackle the problem. We might find an optimization for one extreme and then work

0 commit comments

Comments
 (0)