@@ -25,29 +25,30 @@ in TCMalloc, put protocol buffers into other protocol buffers, or to handle
25
25
branch mispredictions by our processors.
26
26
27
27
To make our fleet more efficient, we want to optimize for how productive our
28
- servers are, that is, how much useful work they accomplish per CPU-second, byte
29
- of RAM, disk IOPS, or by using hardware accelerators. While measuring a job's
30
- resource consumption is easy, it's harder to tell just how much useful work it's
31
- accomplishing without help.
32
-
33
- A task's CPU usage going up could mean it suffered a performance regression or
34
- that it's simply busier. Consider a plot of a service's CPU usage against time,
35
- breaking down the total CPU usage of two versions of the binary. We cannot
36
- determine from casual inspection what caused the increase in CPU usage, whether
37
- this is from an increase in workload (serving more videos per unit time) or a
38
- decrease in efficiency (some added, needless protocol conversion per video).
39
-
40
- To determine what is really happening we need a productivity metric which
28
+ servers are, that is, how much useful work they accomplish per CPU-second,
29
+ byte-second of RAM, disk operation, or by using hardware accelerators. While
30
+ measuring a job's resource consumption is easy, it's harder to tell just how
31
+ much useful work it's accomplishing without help.
32
+
33
+ A task's CPU usage going up could mean the task has suffered a performance
34
+ regression or that it's simply busier. Consider a plot of a service's CPU usage
35
+ against time, breaking down the total CPU usage of two versions of the binary.
36
+ We cannot determine from casual inspection what caused the increase in CPU
37
+ usage, whether this is from an increase in workload (serving more videos per
38
+ unit time) or a decrease in efficiency (some added, needless protocol conversion
39
+ per video).
40
+
41
+ To determine what is really happening, we need a productivity metric which
41
42
captures the amount of real work completed. If we know the number of cat videos
42
- processed we can easily determine whether we are getting more, or less, real
43
- work done per CPU-second (or byte of RAM, disk operation, or hardware
43
+ processed, we can easily determine whether we are getting more, or less, real
44
+ work done per CPU-second (or byte-second of RAM, disk operation, or hardware
44
45
accelerator time). These metrics are referred to as * application productivity
45
46
metrics* , or * APMs* .
46
47
47
48
If we do not have productivity metrics, we are faced with * entire classes of
48
49
optimizations* that are not well-represented by existing metrics:
49
50
50
- * ** Application speedups through core library changes** :
51
+ * ** Application speedups through core infrastructure changes** :
51
52
52
53
As seen in our [ 2021 OSDI paper] ( https://research.google/pubs/pub50370/ ) ,
53
54
"one classical approach is to increase the efficiency of an allocator to
@@ -79,21 +80,21 @@ optimizations* that are not well-represented by existing metrics:
79
80
In future hardware generations, we expect to replace calls to memcpy with
80
81
microcode-optimized ` rep movsb ` instructions that are faster than any
81
82
handwritten assembly sequence we can come up with. We expect ` rep movsb ` to
82
- have low IPC: It's a single instruction that replaces an entire copy loop of
83
- instructions!
83
+ have low IPC (instructions per cycle) : It's a single instruction that
84
+ replaces an entire copy loop of instructions!
84
85
85
86
Using these new instructions can be triggered by optimizing the source code
86
87
or through compiler enhancements that improve vectorization.
87
88
88
- Focusing on MIPS or IPC would cause us to prefer any implementation that
89
- executes a large number of instructions, even if those instructions take
90
- longer to execute to copy ` n ` bytes.
89
+ Focusing on MIPS (millions of instructions per second) or IPC would cause us
90
+ to prefer any implementation that executes a large number of instructions,
91
+ even if those instructions take longer to execute to copy ` n ` bytes.
91
92
92
93
In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with
93
- ` --march=haswell ` shows a MIPS regression while simultaneously improving
94
- * application productivity improvement * . These instructions can do more work
95
- per instruction, however, replacing several low latency instructions may
96
- mean that * average* instruction latency increases. If we had 10 million
94
+ ` --march=haswell ` shows a MIPS regression while simultaneously * improving
95
+ application productivity* . These instructions can do more work per
96
+ instruction, however, replacing several low latency instructions may mean
97
+ that * average* instruction latency increases. If we had 10 million
97
98
instructions and 10 ms per query, we may now have 8 million instructions
98
99
taking only 9 ms per query. QPS is up and MIPS would go down.
99
100
0 commit comments