Skip to content

Commit 0aba7f2

Browse files
committed
assembly guide: added "Split Data Dependency Chains"
1 parent f01ecab commit 0aba7f2

File tree

1 file changed

+99
-2
lines changed

1 file changed

+99
-2
lines changed

arm64-assembly-optimization.md

+99-2
Original file line numberDiff line numberDiff line change
@@ -92,5 +92,102 @@ different types next to each other to take advantage of instruction level
9292
parallelism, ILP. For example, interleaving load instructions with vector or
9393
floating point instructions can keep both pipelines busy.
9494
95-
Next we will discuss breaking data dependencies as a technique for improving performance. Check back
96-
again soon!
95+
96+
## Split Data Dependency Chains
97+
98+
As we saw in the last section, Graviton has multiple pipelines or execution units which can execute instructions. The instructions may execute in parallel if all of the input dependencies have been met but a series of instructions each of which depend on a result from the previous one will not be able to efficiently utilize the resources of the CPU.
99+
100+
For example, a simple C function which takes 64 signed 8-bit integers and adds them all up into one value could be implemented like this:
101+
102+
```
103+
int16_t add_64(int8_t *d) {
104+
int16_t sum = 0;
105+
for (int i = 0; i < 64; i++) {
106+
sum += d[i];
107+
}
108+
return sum;
109+
}
110+
```
111+
112+
We could write this in assembly using NEON SIMD instructions like this:
113+
114+
```
115+
add_64_neon_01:
116+
ld1 {v0.16b, v1.16b, v2.16b, v3.16b} [x0] // load all 64 bytes into vector registers v0-v3
117+
movi v4.2d, #0 // zero v4 for use as an accumulator
118+
saddw v4.8h, v4.8h, v0.8b // add bytes 0-7 of v0 to the accumulator
119+
saddw2 v4.8h, v4.8h, v0.16b // add bytes 8-15 of v0 to the accumulator
120+
saddw v4.8h, v4.8h, v1.8b // ...
121+
saddw2 v4.8h, v4.8h, v1.16b
122+
saddw v4.8h, v4.8h, v2.8b
123+
saddw2 v4.8h, v4.8h, v2.16b
124+
saddw v4.8h, v4.8h, v3.8b
125+
saddw2 v4.8h, v4.8h, v3.16b
126+
addv h0, v4.8h // horizontal add all values in the accumulator to h0
127+
fmov w0, h0 // copy vector registor h0 to general purpose register w0
128+
ret // return with the result in w0
129+
```
130+
131+
In this example, we use a signed add-wide instruction which adds the top and bottom of each register to v4 which is used to accumulate the sum. This is the worst case for data dependency chains because every instruction depends on the result of the previous one which will make it impossible for the CPU to achieve any instruction level parallelism. If we use `llvm-mca` to evaluate it we can see this clearly.
132+
133+
```
134+
Timeline view:
135+
0123456789
136+
Index 0123456789 0123456
137+
138+
[0,0] DeeER. . . . .. movi v4.2d, #0000000000000000
139+
[0,1] D==eeER . . . .. saddw v4.8h, v4.8h, v0.8b
140+
[0,2] D====eeER . . . .. saddw2 v4.8h, v4.8h, v0.16b
141+
[0,3] D======eeER . . .. saddw v4.8h, v4.8h, v1.8b
142+
[0,4] D========eeER . . .. saddw2 v4.8h, v4.8h, v1.16b
143+
[0,5] D==========eeER. . .. saddw v4.8h, v4.8h, v2.8b
144+
[0,6] D============eeER . .. saddw2 v4.8h, v4.8h, v2.16b
145+
[0,7] D==============eeER . .. saddw v4.8h, v4.8h, v3.8b
146+
[0,8] D================eeER .. saddw2 v4.8h, v4.8h, v3.16b
147+
[0,9] D==================eeeeER.. addv h0, v4.8h
148+
[0,10] D======================eeER fmov w0, h0
149+
[0,11] DeE-----------------------R ret
150+
```
151+
152+
153+
One way to break data dependency chains is to use commutative property of addition and change the order which the adds are completed. Consider the following alternative implementation which makes use of pairwise add instructions.
154+
155+
```
156+
add_64_neon_02:
157+
ld1 {v0.16b, v1.16b, v2.16b, v3.16b} [x0]
158+
159+
saddlp v0.8h, v0.16b // signed add pairwise long
160+
saddlp v1.8h, v1.16b
161+
saddlp v2.8h, v2.16b
162+
saddlp v3.8h, v3.16b // after this instruction we have 32 16-bit values
163+
164+
addp v0.8h, v0.8h, v1.8h // add pairwise again v0 and v1
165+
addp v2.8h, v2.8h, v3.8h // now we are down to 16 16-bit values
166+
167+
addp v0.8h, v0.8h, v2.8h // 8 16-bit values
168+
addv h0, v4.8h // add 8 remaining values across vector
169+
fmov w0, h0
170+
171+
ret
172+
```
173+
174+
In this example the first 4 instructions after the load can execute independently and then the next 2 are also independent of each other. However, the last 3 instructions do have data dependencies on each other. If we take a look with `llvm-mca` again, we can see that this implementation takes 10 cycles (excluding the initial load instruction common to both implementations) and the original takes 27 cycles.
175+
176+
```
177+
Timeline view:
178+
Index 0123456789
179+
180+
[0,0] DeeER. . saddlp v0.8h, v0.16b
181+
[0,1] DeeER. . saddlp v1.8h, v1.16b
182+
[0,2] DeeER. . saddlp v2.8h, v2.16b
183+
[0,3] DeeER. . saddlp v3.8h, v3.16b
184+
[0,4] D==eeER . addp v0.8h, v0.8h, v1.8h
185+
[0,5] D==eeER . addp v2.8h, v2.8h, v3.8h
186+
[0,6] D====eeER. addp v0.8h, v0.8h, v2.8h
187+
[0,7] D=eeeeE-R. addv h0, v4.8h
188+
[0,8] D=====eeER fmov w0, h0
189+
[0,9] DeE------R ret
190+
```
191+
192+
193+
Next we will discuss modulo scheduling. Check back again soon!

0 commit comments

Comments
 (0)