-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm backend fix #96
Arm backend fix #96
Conversation
Provides an arm backend for matrix multiplies using google/ruy and math functions through simde (https://simd-everywhere.github.io/blog/about/) effectively getting marian-decoder to run on ARM. The following cmake flags are added: - USE_INTGEMM (to switch intgemm on/off) - USE_RUY (to switch ruy on/off) - USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention matrix multiply which is currently reliant on a BLAS library). - USE_SIMDE (swaps out existing intel based functions by using SIMDE instead). The built marian-decoder is tested on an Oracle Cloud ARM Machine with the following specs: Architecture : aarch64 CPU op-mode(s) : 32-bit, 64-bit Byte Order : Little Endian Vendor ID : ARM Model name : Neoverse-N1 Flags : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs A CI check on GitHub actions is added to use android-ndk cross-compile targetting arm64-v8a. The built binary is tested to work on an Android Phone using termux (Samsung M30s). Successful android build additionally requires a patch (sentencepiece -> protobuf). See opencv/opencv#17282 and opencv/opencv#19049. -Werror etc causes issues with ruy (-Wmulti-line-comment) and are disabled. The following minor changes are also applied: - Remove M32_BINARIES use COMPILE_WASM for -m32 - Hide msse4.1 if unknown platform - faiss was previously hardcoded for platforms with SSE available. This has been mitigated by adding a refernce standard cpp implementation of the missing function. - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off - MSVC workaround following #56 (comment)
Bit ugly for now, but we are headed towards better.
…-march=native" This reverts commit 3cf85f7.
There are no accompanying tests.
* Remove SIMDE dependency from integer_common.cpp:AddBias * Typo: __ARM_NEON__ * Revert "Typo: __ARM_NEON__" This reverts commit f29a0bb. * Typo __ARM_NEON__: NoAutoFormatBuffer * AVX and optional expansion, include guarded by __AVX__ instead of SIMDE * Create an ARM_NEON structure to advance compilation without SIMDE * Import neon header files * No SIMDE for sse_mathfun * Import submodule as a whole, can't do only neon * Removing old files * Update simdutils submodule usage * Remove simde submodule * More SIMDE removal * Remove the neon_mathfun.h file * USE_SIMD_UTILS instead of USE_SIMDE in CI * xmmintrin.h include if SSE for WebAssembly * Include simd_utils only if flag set * Create a dummy float32x4 because Windows * #else block is SSE __m128d now, old behaviour * Windows does not give us flags, expects sse_mathfun * Point simd_utils to a fork with an experimental patch * Restore TODO, it's valid after the reset
- Underscore suffix for curried args. - Make args private.
|
||
# Apple M1 has Apple Accelerate. Otherwise fallback to RUY | ||
if(APPLE) | ||
option(USE_RUY_SGEMM "Compile with Ruy SGEMM" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could have been a CMAKE_DEPENDENT_OPTION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
add_compile_definitions(ARM) | ||
# | ||
else() | ||
set(USE_INTGEMM ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should be setting architecture-dependent defaults that the user can override. Right now it has default off, then arch-dependent set to on. Compiling without intgemm on x86 is a feasible option that should still be accessible to the user.
Can we delay introducing USE_INTGEMM
then make it a CMAKE_DEPENDENT_OPTION?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USE_INTGEMM
was introduce by jerin Back then the ARM backend was dependent on intgemm. I can clean it up.
I am not quite sure in which scenario we would want to compile on x86 without intgemm though.
float32x4(const __m128& f) : f_(f) {} | ||
// __m128 _mm_set1_ps(float) copies value into all slots, vdupq_n_f32 is it's | ||
// NEON equivalent. | ||
float32x4(const float& f) : f_(vdupq_n_f32(f)) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be explicit? I recognize it wasn't in the base version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a situation where we could be unpleasantly surprised by implicit conversion, but I will add it to make sure.
src/tensors/cpu/intgemm_interface.h
Outdated
float unquant_mult = (-1)*((127.0f / *quant_mult_a->data())*(127.0f / *quant_mult_b->data()))/(127.0f); //Minus one to invert add_ps later on | ||
intgemm::Int8Shift::PrepareBias((const int8_t *)b->data(), rows(b), cols(b), intgemm::callbacks::UnquantizeAndWrite(unquant_mult, val_->data())); | ||
#else | ||
// Not sure what's going on here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an abort here then? ARM builds would be in this else, but ARM calls should be in the ruy interface now.
src/tensors/cpu/ruy_interface.h
Outdated
// TODO(jerin): Enable | ||
// assert(rows % tile_size == 0 && cols & tile_size == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to remove the comment, it's already enabled.
Arm backend decoupled from intgemm.
Code is reused wherever reasonable. PrepareColumnsB could probably be a bit more unified, but I decided against it. No model binarization from ARM although that wouldn't be too difficult to do (I just found it unnecessary). Supersedes #79