Aarch64 performance fixes #33

hkratz · 2021-05-01T16:50:15Z

Mainly two things:

The fixed loop of four 128-bit chunks was not automatically unrolled. It is hand-unrolled now. This does not change the assembly output on x64.
The vld1q_u8 intrinsic is broken. The compiler thinks it can "optimize" loads by loading bytes individually if a SIMD shuffle instruction follows. According to the ARM docs it should be coded as one instruction. This had an effect on the code when the loop was manually unrolled. Workaround: see code.

Issue filed: Aarch64 performance: vld1q_u8 intrinsic can cause single-byte loads rust-lang/stdarch#1148

…into aarch64_perf_fixes

src/implementation/aarch64/neon.rs

…into aarch64_perf_fixes

hkratz added 10 commits May 1, 2021 17:30

Workaround for broken vld1q_u8 intrinsic

559893e

Unroll loop manually because Rust does not on ARM64

e8a78ee

silence incorrect error msg

20f802f

remove extra parens

619935b

silence Rust 1.38.0 warning

8148bb7

Workaround for broken vld1q_u8 intrinsic

ca5a803

Unroll loop manually because Rust does not on ARM64

17d3894

silence incorrect error msg

d390026

remove extra parens

5990e5e

silence Rust 1.38.0 warning

38d84ab

hkratz force-pushed the aarch64_perf_fixes branch from 8148bb7 to 38d84ab Compare May 1, 2021 18:11

hkratz added 2 commits May 1, 2021 21:57

Merge branch 'aarch64_perf_fixes' of github.com:rusticstuff/simdutf8 …

03bd47f

…into aarch64_perf_fixes

Merge branch 'main' into aarch64_perf_fixes

b854bd9

ArniDagur reviewed May 1, 2021

View reviewed changes

src/implementation/aarch64/neon.rs Outdated Show resolved Hide resolved

hkratz added 2 commits May 2, 2021 05:56

Fix UB: first init, then assume_init() - same asm output

64ae26f

Merge branch 'aarch64_perf_fixes' of github.com:rusticstuff/simdutf8 …

2c549d0

…into aarch64_perf_fixes

hkratz merged commit 2e59822 into main May 2, 2021

hkratz deleted the aarch64_perf_fixes branch May 2, 2021 07:14

Provide feedback