Arm backend fix #96

XapaJIaMnu · 2022-09-27T14:20:36Z

Arm backend decoupled from intgemm.

Code is reused wherever reasonable. PrepareColumnsB could probably be a bit more unified, but I decided against it. No model binarization from ARM although that wouldn't be too difficult to do (I just found it unnecessary). Supersedes #79

Provides an arm backend for matrix multiplies using google/ruy and math functions through simde (https://simd-everywhere.github.io/blog/about/) effectively getting marian-decoder to run on ARM. The following cmake flags are added: - USE_INTGEMM (to switch intgemm on/off) - USE_RUY (to switch ruy on/off) - USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention matrix multiply which is currently reliant on a BLAS library). - USE_SIMDE (swaps out existing intel based functions by using SIMDE instead). The built marian-decoder is tested on an Oracle Cloud ARM Machine with the following specs: Architecture : aarch64 CPU op-mode(s) : 32-bit, 64-bit Byte Order : Little Endian Vendor ID : ARM Model name : Neoverse-N1 Flags : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs A CI check on GitHub actions is added to use android-ndk cross-compile targetting arm64-v8a. The built binary is tested to work on an Android Phone using termux (Samsung M30s). Successful android build additionally requires a patch (sentencepiece -> protobuf). See opencv/opencv#17282 and opencv/opencv#19049. -Werror etc causes issues with ruy (-Wmulti-line-comment) and are disabled. The following minor changes are also applied: - Remove M32_BINARIES use COMPILE_WASM for -m32 - Hide msse4.1 if unknown platform - faiss was previously hardcoded for platforms with SSE available. This has been mitigated by adding a refernce standard cpp implementation of the missing function. - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off - MSVC workaround following #56 (comment)

…ative

Bit ugly for now, but we are headed towards better.

…-march=native" This reverts commit 3cf85f7.

…apply -msse4.1.

There are no accompanying tests.

* Remove SIMDE dependency from integer_common.cpp:AddBias * Typo: __ARM_NEON__ * Revert "Typo: __ARM_NEON__" This reverts commit f29a0bb. * Typo __ARM_NEON__: NoAutoFormatBuffer * AVX and optional expansion, include guarded by __AVX__ instead of SIMDE * Create an ARM_NEON structure to advance compilation without SIMDE * Import neon header files * No SIMDE for sse_mathfun * Import submodule as a whole, can't do only neon * Removing old files * Update simdutils submodule usage * Remove simde submodule * More SIMDE removal * Remove the neon_mathfun.h file * USE_SIMD_UTILS instead of USE_SIMDE in CI * xmmintrin.h include if SSE for WebAssembly * Include simd_utils only if flag set * Create a dummy float32x4 because Windows * #else block is SSE __m128d now, old behaviour * Windows does not give us flags, expects sse_mathfun * Point simd_utils to a fork with an experimental patch * Restore TODO, it's valid after the reset

- Underscore suffix for curried args. - Make args private.

This reverts commit 1b38e01.

This reverts commit 4b80399.

kpu · 2022-10-03T07:53:09Z

CMakeLists.txt

+
+  # Apple M1 has Apple Accelerate. Otherwise fallback to RUY
+  if(APPLE)
+    option(USE_RUY_SGEMM "Compile with Ruy SGEMM" OFF)


This could have been a CMAKE_DEPENDENT_OPTION

kpu · 2022-10-03T07:55:25Z

CMakeLists.txt

+  add_compile_definitions(ARM)
+  #
+else()
+  set(USE_INTGEMM ON)


Ideally we should be setting architecture-dependent defaults that the user can override. Right now it has default off, then arch-dependent set to on. Compiling without intgemm on x86 is a feasible option that should still be accessible to the user.

Can we delay introducing USE_INTGEMM then make it a CMAKE_DEPENDENT_OPTION?

USE_INTGEMM was introduce by jerin Back then the ARM backend was dependent on intgemm. I can clean it up.

I am not quite sure in which scenario we would want to compile on x86 without intgemm though.

kpu · 2022-10-03T08:12:45Z

src/common/types.h

+  float32x4(const __m128& f) : f_(f) {}
+  // __m128 _mm_set1_ps(float) copies value into all slots, vdupq_n_f32 is it's
+  // NEON equivalent.
+  float32x4(const float& f) : f_(vdupq_n_f32(f)) {} 


Should this be explicit? I recognize it wasn't in the base version.

I don't see a situation where we could be unpleasantly surprised by implicit conversion, but I will add it to make sure.

graemenail · 2022-10-07T09:17:46Z

src/tensors/cpu/intgemm_interface.h

    float unquant_mult = (-1)*((127.0f / *quant_mult_a->data())*(127.0f / *quant_mult_b->data()))/(127.0f); //Minus one to invert add_ps later on
    intgemm::Int8Shift::PrepareBias((const int8_t *)b->data(), rows(b), cols(b), intgemm::callbacks::UnquantizeAndWrite(unquant_mult, val_->data()));
+  #else
+    // Not sure what's going on here. 


Maybe an abort here then? ARM builds would be in this else, but ARM calls should be in the ruy interface now.

src/tensors/cpu/ruy_interface.h

graemenail · 2022-10-07T09:57:26Z

src/tensors/cpu/ruy_interface.h

+  // TODO(jerin): Enable
+  // assert(rows % tile_size == 0 && cols & tile_size == 0);


need to remove the comment, it's already enabled.

jerinphilip and others added 30 commits March 9, 2022 20:02

Fix sentencepiece submodule mixup

2ac7cbc

Merge branch 'browsermt-master' into arm-backend

93b841b

[sentencepiece] android cmake additional libs

9674973

Remove separately added patch in favour of submodule update

f3e7818

Remove trailing newline in integer_common.h to prettify diff

5250b9e

Remove trailing newline in ruy_adapter.h

26d3ba2

Merge branch 'browsermt-master' into arm-backend

b7969b0

In-place multiply without malloc by reinterpret_cast

b271b70

Documentation for the stdcpp/NEON paths created

efa5a85

Remove templated abort transpose()

179f239

Reinterpret at unquantize add bias as well as int32_t from float32_t

0d189c8

Remove AlignedVector from ruy_adapter - not required here.

8951261

Remove ViaRuy::PrepareBias without effect to output

49beb50

If SSE4.1 found use it to avoid perf regressions even if not -march=n…

3cf85f7

…ative

Deduplicate multiply by capturing variability through callbacks

4edc8ef

Bit ugly for now, but we are headed towards better.

Revert "If SSE4.1 found use it to avoid perf regressions even if not …

a414b60

…-march=native" This reverts commit 3cf85f7.

CMAKE_SYSTEM_PROCESSOR indicates x86 and native mode is not enabled, …

e2069bf

…apply -msse4.1.

Remove comments, now that callback is working

1b4049a

Minimal gemmRuy

e522e6c

There are no accompanying tests.

Update CI

90858a5

Style fixes: UnquantizeAndWrite, UnquantizeAddBiasAndWrite

d10009f

- Underscore suffix for curried args. - Make args private.

const for () operator overrides

3a37966

Explicit for single argument constructor: UnquantizeAndWrite

418a7ce

Fix typo

b7412c3

Low compute path for special case alpha = 1.0

071e0d4

Remove clang only pragmas

ec886bd

Remove leftover bias cycles comment

4df1998

Merge branch 'master' into arm-backend

1defce6

jerinphilip and others added 21 commits June 21, 2022 10:06

Switch to a {{0}} sigaction on WASM, {0} for rest

9027ea4

Revert "Restore -Werror"

a0ee527

This reverts commit 1b38e01.

Use -DFMA for NEON from simd_utils example

6285f28

Remove redundant neon_mathfun include after simd_utils.h

8895fda

Wrap CmakeLists.txt ARM definitions with an if

c6c3ac6

Use __clang__ instead of WASM_COMPATIBLE_SOURCE; emcc uses LLVM

3baf620

Suppress warnings by #pragma GCC diagnostic ...

aa1842c

Re-enable -Werror

8eae08b

{0} -> {} to work around empty-braces Werror

9a541c4

Replace -Wall with -Wcomment

4b80399

Revert "Replace -Wall with -Wcomment"

ac8de91

This reverts commit 4b80399.

Disable formatting then local edit -Wall -> -Wcomment

38b608a

Do not check for BLAS on usual ARM, except Mac: Apple Accelerate

86c8d44

Fix endif: CMakeScript quirks

861e31d

Towards simplifications

c82b628

WIP. fp32 works, albeit compilation flags are suboptimal

4233faf

Proper prepareB support

cee1c95

Towards simplifying ruy

34eea09

Simplify cmake and get on the way to simplification

e74cf4d

Independent ARM codepath

1dbc30c

Decouple RUY from intgemm

ab936f9

XapaJIaMnu requested review from kpu and graemenail September 27, 2022 14:20

kpu reviewed Oct 7, 2022

View reviewed changes

graemenail reviewed Oct 7, 2022

View reviewed changes

XapaJIaMnu added 4 commits January 17, 2023 14:45

remove outdated comment

10626d0

Merge branch 'master' into arm-backend-fix

193a407

Add an abort case

d033f06

update catch

348c283

XapaJIaMnu merged commit 4b30c26 into master Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm backend fix #96

Arm backend fix #96

XapaJIaMnu commented Sep 27, 2022

kpu Oct 3, 2022

XapaJIaMnu Oct 7, 2022

kpu Oct 3, 2022

XapaJIaMnu Oct 7, 2022

kpu Oct 3, 2022

XapaJIaMnu Oct 7, 2022

graemenail Oct 7, 2022

graemenail Oct 7, 2022

XapaJIaMnu Oct 7, 2022

		// TODO(jerin): Enable
		// assert(rows % tile_size == 0 && cols & tile_size == 0);

Arm backend fix #96

Arm backend fix #96

Conversation

XapaJIaMnu commented Sep 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment