Skip to content

Aarch64 performance: vld1q_u8 intrinsic can cause single-byte loads #1148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hkratz opened this issue May 2, 2021 · 4 comments · Fixed by #1207
Closed

Aarch64 performance: vld1q_u8 intrinsic can cause single-byte loads #1148

hkratz opened this issue May 2, 2021 · 4 comments · Fixed by #1207

Comments

@hkratz
Copy link
Contributor

hkratz commented May 2, 2021

While adding aarch64 support to simdutf8 I encountered an unexpected eight times slowdown when hand-unrolling a loop. This slowdown was the result of the compiler deciding to suddenly load 128-bit uint8x16_t values with single-byte load instructions instead of 128-bit loads.

It turns out, that the vld1q_u8 intrinsic is at fault. The code generator thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to the ARM docs this intrinsic should always be coded as one instruction. I fixed it by doing the load similar to how it is currently done for SSE2.

Testcase and proposed fix on Godbolt

The same issue likely applies is to the other vld1q intrinsics.

hkratz added a commit to rusticstuff/simdutf8 that referenced this issue May 2, 2021
- The fixed loop of four 128-bit chunks was not automatically unrolled. It is hand-unrolled now. This does not change the assembly output on x64.
- The [vld1q_u8](https://doc.rust-lang.org/stable/core/arch/aarch64/fn.vld1q_u8.html) intrinsic is broken. The compiler thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to [the ARM docs](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vld1q_u8) it should be coded as one instruction. This had an effect on the code when the loop was manually unrolled. Workaround: see code.
  
  Issue filed: rust-lang/stdarch#1148
@Amanieu
Copy link
Member

Amanieu commented May 2, 2021

Comparing to the IR generated by Clang, the real issue is that we should be calling llvm.aarch64.neon.ld1x4.v16i8.p0i8 instead of doing the loads manually.

cc @SparrowLii

@Amanieu
Copy link
Member

Amanieu commented May 2, 2021

Ah but that's currently blocked on #1143, which is a rustc limitation on returning tuples from intrinsics.

@SparrowLii
Copy link
Member

SparrowLii commented May 4, 2021

In general, these implementations will be optimized by the compiler to a ldr instruction:
godbolt
In Clang, the implementation also directly uses the load instruction:
godbolt
Although I don’t know much about the optimization reasons of the compiler, in view of the current clear improvement scenarios, I think it is reasonable to change the implementation of vld1* instructions. For us, these instructions do not need to call llvm.aarch64.neon.*

@hkratz
Copy link
Contributor Author

hkratz commented May 6, 2021

In Clang, the implementation also directly uses the load instruction:
godbolt

That is really interesting...

However with clang it does not break up the loads when the vector register is actually used:
https://godbolt.org/z/ozMev64sz

In contrast to Rust:
https://godbolt.org/z/rj1zv8PjW

That means it is likely not the fault of the vld1q_u8 intrinsic at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants