Aarch64 performance: `vld1q_u8` intrinsic can cause single-byte loads #1148

hkratz · 2021-05-02T05:36:43Z

While adding aarch64 support to simdutf8 I encountered an unexpected eight times slowdown when hand-unrolling a loop. This slowdown was the result of the compiler deciding to suddenly load 128-bit uint8x16_t values with single-byte load instructions instead of 128-bit loads.

It turns out, that the vld1q_u8 intrinsic is at fault. The code generator thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to the ARM docs this intrinsic should always be coded as one instruction. I fixed it by doing the load similar to how it is currently done for SSE2.

Testcase and proposed fix on Godbolt

The same issue likely applies is to the other vld1q intrinsics.

The text was updated successfully, but these errors were encountered:

- The fixed loop of four 128-bit chunks was not automatically unrolled. It is hand-unrolled now. This does not change the assembly output on x64. - The [vld1q_u8](https://doc.rust-lang.org/stable/core/arch/aarch64/fn.vld1q_u8.html) intrinsic is broken. The compiler thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to [the ARM docs](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vld1q_u8) it should be coded as one instruction. This had an effect on the code when the loop was manually unrolled. Workaround: see code. Issue filed: rust-lang/stdarch#1148

Amanieu · 2021-05-02T10:29:49Z

Comparing to the IR generated by Clang, the real issue is that we should be calling llvm.aarch64.neon.ld1x4.v16i8.p0i8 instead of doing the loads manually.

cc @SparrowLii

Amanieu · 2021-05-02T10:30:59Z

Ah but that's currently blocked on #1143, which is a rustc limitation on returning tuples from intrinsics.

SparrowLii · 2021-05-04T02:16:38Z

In general, these implementations will be optimized by the compiler to a ldr instruction:
godbolt
In Clang, the implementation also directly uses the load instruction:
godbolt
Although I don’t know much about the optimization reasons of the compiler, in view of the current clear improvement scenarios, I think it is reasonable to change the implementation of vld1* instructions. For us, these instructions do not need to call llvm.aarch64.neon.*

hkratz · 2021-05-06T18:03:58Z

In Clang, the implementation also directly uses the load instruction:
godbolt

That is really interesting...

However with clang it does not break up the loads when the vector register is actually used:
https://godbolt.org/z/ozMev64sz

In contrast to Rust:
https://godbolt.org/z/rj1zv8PjW

That means it is likely not the fault of the vld1q_u8 intrinsic at all.

hkratz mentioned this issue May 2, 2021

Aarch64 performance fixes rusticstuff/simdutf8#33

Merged

hkratz mentioned this issue Sep 7, 2021

Change aarch64 vld1* instructions to not cause individual loads #1207

Merged

2 tasks

Amanieu closed this as completed in #1207 Sep 8, 2021

Amanieu mentioned this issue Sep 10, 2021

Complete vld1 instructions with some corrections #1216

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aarch64 performance: `vld1q_u8` intrinsic can cause single-byte loads #1148

Aarch64 performance: `vld1q_u8` intrinsic can cause single-byte loads #1148

hkratz commented May 2, 2021

Amanieu commented May 2, 2021

Amanieu commented May 2, 2021

SparrowLii commented May 4, 2021 •

edited

Loading

hkratz commented May 6, 2021

Aarch64 performance: vld1q_u8 intrinsic can cause single-byte loads #1148

Aarch64 performance: vld1q_u8 intrinsic can cause single-byte loads #1148

Comments

hkratz commented May 2, 2021

Amanieu commented May 2, 2021

Amanieu commented May 2, 2021

SparrowLii commented May 4, 2021 • edited Loading

hkratz commented May 6, 2021

Aarch64 performance: `vld1q_u8` intrinsic can cause single-byte loads #1148

Aarch64 performance: `vld1q_u8` intrinsic can cause single-byte loads #1148

SparrowLii commented May 4, 2021 •

edited

Loading