-
Notifications
You must be signed in to change notification settings - Fork 288
Aarch64 performance: vld1q_u8
intrinsic can cause single-byte loads
#1148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
- The fixed loop of four 128-bit chunks was not automatically unrolled. It is hand-unrolled now. This does not change the assembly output on x64. - The [vld1q_u8](https://doc.rust-lang.org/stable/core/arch/aarch64/fn.vld1q_u8.html) intrinsic is broken. The compiler thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to [the ARM docs](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vld1q_u8) it should be coded as one instruction. This had an effect on the code when the loop was manually unrolled. Workaround: see code. Issue filed: rust-lang/stdarch#1148
Comparing to the IR generated by Clang, the real issue is that we should be calling cc @SparrowLii |
Ah but that's currently blocked on #1143, which is a rustc limitation on returning tuples from intrinsics. |
In general, these implementations will be optimized by the compiler to a |
That is really interesting... However with clang it does not break up the loads when the vector register is actually used: In contrast to Rust: That means it is likely not the fault of the |
While adding aarch64 support to simdutf8 I encountered an unexpected eight times slowdown when hand-unrolling a loop. This slowdown was the result of the compiler deciding to suddenly load 128-bit
uint8x16_t
values with single-byte load instructions instead of 128-bit loads.It turns out, that the
vld1q_u8
intrinsic is at fault. The code generator thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to the ARM docs this intrinsic should always be coded as one instruction. I fixed it by doing the load similar to how it is currently done for SSE2.Testcase and proposed fix on Godbolt
The same issue likely applies is to the other
vld1q
intrinsics.The text was updated successfully, but these errors were encountered: