Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm backend fix #96

Merged
merged 81 commits into from
Jan 18, 2023
Merged

Arm backend fix #96

merged 81 commits into from
Jan 18, 2023

Conversation

XapaJIaMnu
Copy link
Collaborator

Arm backend decoupled from intgemm.

Code is reused wherever reasonable. PrepareColumnsB could probably be a bit more unified, but I decided against it. No model binarization from ARM although that wouldn't be too difficult to do (I just found it unnecessary). Supersedes #79

jerinphilip and others added 30 commits March 9, 2022 20:02
Provides an arm backend for matrix multiplies using google/ruy and math
functions through simde (https://simd-everywhere.github.io/blog/about/)
effectively getting marian-decoder to run on ARM.

The following cmake flags are added:

- USE_INTGEMM (to switch intgemm on/off)
- USE_RUY (to switch ruy on/off)
- USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention
  matrix multiply which is currently reliant on a BLAS library).
- USE_SIMDE (swaps out existing intel based functions by using SIMDE
  instead).

The built marian-decoder is tested on an Oracle Cloud ARM Machine with
the following specs:

  Architecture   : aarch64
  CPU op-mode(s) : 32-bit, 64-bit
  Byte Order     : Little Endian
  Vendor ID      : ARM
  Model name     : Neoverse-N1
  Flags          : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                   asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

A CI check on GitHub actions is added to use android-ndk cross-compile
targetting arm64-v8a. The built binary is tested to work on an Android
Phone using termux (Samsung M30s).

Successful android build additionally requires a patch (sentencepiece ->
protobuf). See opencv/opencv#17282 and
opencv/opencv#19049.

-Werror etc causes issues with ruy (-Wmulti-line-comment) and are
disabled.

The following minor changes are also applied:

 - Remove M32_BINARIES use COMPILE_WASM for -m32
 - Hide msse4.1 if unknown platform
 - faiss was previously hardcoded for platforms with SSE available. This
   has been mitigated by adding a refernce standard cpp implementation of
   the missing function.
 - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off
 - MSVC workaround following
   #56 (comment)
Bit ugly for now, but we are headed towards better.
There are no accompanying tests.
* Remove SIMDE dependency from integer_common.cpp:AddBias
* Typo: __ARM_NEON__
* Revert "Typo: __ARM_NEON__"
This reverts commit f29a0bb.
* Typo __ARM_NEON__: NoAutoFormatBuffer
* AVX and optional expansion, include guarded by __AVX__ instead of SIMDE
* Create an ARM_NEON structure to advance compilation without SIMDE
* Import neon header files
* No SIMDE for sse_mathfun
* Import submodule as a whole, can't do only neon
* Removing old files
* Update simdutils submodule usage
* Remove simde submodule
* More SIMDE removal
* Remove the neon_mathfun.h file
* USE_SIMD_UTILS instead of USE_SIMDE in CI
* xmmintrin.h include if SSE for WebAssembly
* Include simd_utils only if flag set
* Create a dummy float32x4 because Windows
* #else block is SSE __m128d now, old behaviour
* Windows does not give us flags, expects sse_mathfun
* Point simd_utils to a fork with an experimental patch
* Restore TODO, it's valid after the reset
- Underscore suffix for curried args.
- Make args private.

# Apple M1 has Apple Accelerate. Otherwise fallback to RUY
if(APPLE)
option(USE_RUY_SGEMM "Compile with Ruy SGEMM" OFF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have been a CMAKE_DEPENDENT_OPTION

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

add_compile_definitions(ARM)
#
else()
set(USE_INTGEMM ON)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should be setting architecture-dependent defaults that the user can override. Right now it has default off, then arch-dependent set to on. Compiling without intgemm on x86 is a feasible option that should still be accessible to the user.

Can we delay introducing USE_INTGEMM then make it a CMAKE_DEPENDENT_OPTION?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USE_INTGEMM was introduce by jerin Back then the ARM backend was dependent on intgemm. I can clean it up.

I am not quite sure in which scenario we would want to compile on x86 without intgemm though.

float32x4(const __m128& f) : f_(f) {}
// __m128 _mm_set1_ps(float) copies value into all slots, vdupq_n_f32 is it's
// NEON equivalent.
float32x4(const float& f) : f_(vdupq_n_f32(f)) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be explicit? I recognize it wasn't in the base version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a situation where we could be unpleasantly surprised by implicit conversion, but I will add it to make sure.

float unquant_mult = (-1)*((127.0f / *quant_mult_a->data())*(127.0f / *quant_mult_b->data()))/(127.0f); //Minus one to invert add_ps later on
intgemm::Int8Shift::PrepareBias((const int8_t *)b->data(), rows(b), cols(b), intgemm::callbacks::UnquantizeAndWrite(unquant_mult, val_->data()));
#else
// Not sure what's going on here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an abort here then? ARM builds would be in this else, but ARM calls should be in the ruy interface now.

Comment on lines 162 to 163
// TODO(jerin): Enable
// assert(rows % tile_size == 0 && cols & tile_size == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove the comment, it's already enabled.

@XapaJIaMnu XapaJIaMnu merged commit 4b30c26 into master Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants