[UKernel] Improve peano zeroization ukernel #1107

jtuyls · 2025-02-14T16:08:58Z

No description provided.

jtuyls · 2025-02-14T16:19:13Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Target/mm_npu4_peano.cc

+  v16int32 zeros = broadcast_zero_to_v16int32();
+  for (unsigned i = offsetC / r; i < offsetC / r + M * N / r; i++) {
+    pC[i] = zeros;


Without the -fno-builtin-memset flag we get the following assembly with scalar stores as the memset is recognized and converted into a libcall, which works on bytes, because we don't know the alignment (got this info from peano folks):

00000000 <_Z15zero_vectorizedIDv16_iLi32ELi32ELi16EEvPT_j>: 0: e1 00 00 70 00 00 00 02 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; paddxm [sp], #0x40; nopv 10: 7e 60 2b 00 08 00 00 00 00 20 00 00 41 00 mova r1, #0x2; nopb ; jl #0x0; nops 1e: 2c 3b 00 00 01 f8 mova r1, #-0x40; lshl r0, r0, r1 24: 98 14 00 10 and r0, r0, r1 28: 02 70 10 00 00 b0 07 f8 st lr, [sp, #-64]; mov m0, r0 30: ba 10 00 28 04 00 00 f0 0c 01 padda [p0], m0; movxm r1, #0x1000 3a: d4 81 c1 02 00 00 mova r0, #0x0; mov p1, p0 40: e1 00 00 00 00 00 00 00 00 5b 01 20 00 20 07 f8 lda lr, [sp, #-64]; nopb ; nops ; nopxm ; nopv 50: 18 00 00 10 nopx ... 5c: 00 00 nop 5e: 18 00 28 10 ret lr 62: 00 00 nop 64: 00 00 nop 66: 00 00 nop 68: c4 01 00 00 f8 ff paddxm [sp], #-0x40 6e: 00 00 nop

But with the flag, we see the vectorized store (vst)

00000000 <_Z15zero_vectorizedIDv16_iLi32ELi32ELi16EEvPT_j>: 0: e1 00 00 10 00 78 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; movxm ls, #0x0; nopv 10: b6 10 00 b8 01 00 00 20 00 00 02 08 mova r2, #0x40; nopb ; movxm le, #0x0 1c: 18 00 71 1d add.nc lc, r2, #0x0 20: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv 30: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv 40: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv 50: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv 60: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv 70: e1 00 00 78 a5 01 08 10 00 5b 01 20 00 00 83 ff mova r3, #-0x4; nopb ; nops ; movx r1, #0x0; nopm ; nopv 80: e1 00 00 78 39 03 ec 01 00 5b 01 20 00 00 c2 00 mova r2, #0x6; nopb ; nops ; lshl r0, r0, r3; vbcst.32 x0, r1; nopv 00000090 <.LBB3_1>: 90: 92 10 06 20 00 f0 2c 00 nopa ; nopb ; add r3, r0, r1 98: 98 2d c6 10 lshl r3, r3, r2 9c: f8 a0 81 18 mov dj0, r3 000000a0 <.L_LEnd0>: a0: e1 00 00 78 a5 01 38 10 02 13 00 20 00 f0 2c 00 nopa ; nopb ; vst x0, [p0, dj0]; add r1, r1, #0x1; nopm ; nopv b0: 2c 00 50 f0 2c 00 nopa ; ret lr ... be: 00 00 nop

jtuyls · 2025-02-14T16:20:06Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Target/mm_npu4_peano.cc

@@ -25,6 +22,15 @@ void matmul_vectorized_i8_i32(const int8 * __restrict pA, unsigned offsetA, cons
  const unsigned size_B = L0_K * L0_N;
  const unsigned size_C = L0_M * L0_N;

+  v64int8 A0;


I have seen issues with loop pipelining if these are defined in the inner loops.

Yu-Zhewen

LGTM, thanks!

newling · 2025-02-14T16:48:49Z

Nice find. I see vst generated in https://github.com/nod-ai/iree-amd-aie/pull/1095/files without this flag, maybe because the LLVMIR already has llvm.stores of the correct size (?).

jtuyls · 2025-02-14T16:57:55Z

Nice find. I see vst generated in https://github.com/nod-ai/iree-amd-aie/pull/1095/files without this flag, maybe because the LLVMIR already has llvm.stores of the correct size (?).

Yeah, maybe.. I need to check the ll of these examples I have above. There could be other reasons for the peano not recognizing the memset and/or if the loop extent is small enough the vstores are also generated, for example:

void zero_vectorized(v16int32 *__restrict pC)
{
  v16int32 zeros = broadcast_zero_s32();
  for (unsigned i = 0; i < 8; i++) {
    pC[i] = zeros;
  }
}

compiles to (without -fno-builtin-memset):

00000000 <_Z15zero_vectorizedIDv16_iLi32ELi32ELi16EEvPT_j>:
      0: 2c 00 00 00 00 00     mova    r0, #0x0;               nopx
      6: f8 72 02 18   vbcst.32         x0, r0
      a: 00 00         nop
      c: 98 2a 1c 08   vst      wl0, [p0], #0x20
     10: 98 2a 1c 08   vst      wl0, [p0], #0x20
     14: 98 2a 1c 08   vst      wl0, [p0], #0x20
     18: 98 2a 1c 08   vst      wl0, [p0], #0x20
     1c: 98 2a 1c 08   vst      wl0, [p0], #0x20
     20: 98 2a 1c 08   vst      wl0, [p0], #0x20
     24: 98 2a 1c 08   vst      wl0, [p0], #0x20
     28: 98 2a 1c 08   vst      wl0, [p0], #0x20
     2c: 98 2a 1c 08   vst      wl0, [p0], #0x20
     30: 98 2a 1c 08   vst      wl0, [p0], #0x20
     34: 98 2a 1c 08   vst      wl0, [p0], #0x20
     38: 5c 00 50 50 85 03     vst      wl0, [p0], #0x20;              ret     lr
     3e: 98 2a 1c 08   vst      wl0, [p0], #0x20
     42: 98 2a 1c 08   vst      wl0, [p0], #0x20
     46: 98 2a 1c 08   vst      wl0, [p0], #0x20
     4a: 98 2a 04 08   vst      wl0, [p0, #0x0]
     4e: 00 00         nop

[UKernel] Improve peano zeroization ukernel

b20be9d

jtuyls requested review from makslevental, nirvedhmeshram, newling, MaheshRavishankar, yzhang93 and Abhishek-Varma as code owners February 14, 2025 16:08

jtuyls commented Feb 14, 2025

View reviewed changes

Yu-Zhewen approved these changes Feb 14, 2025

View reviewed changes

newling approved these changes Feb 14, 2025

View reviewed changes

jtuyls merged commit 14ccb7e into nod-ai:main Feb 14, 2025
7 checks passed

jtuyls deleted the improve-peano-zero-ukernel branch February 14, 2025 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UKernel] Improve peano zeroization ukernel #1107

[UKernel] Improve peano zeroization ukernel #1107

jtuyls commented Feb 14, 2025

jtuyls Feb 14, 2025

jtuyls Feb 14, 2025

Yu-Zhewen left a comment

newling commented Feb 14, 2025

jtuyls commented Feb 14, 2025

[UKernel] Improve peano zeroization ukernel #1107

[UKernel] Improve peano zeroization ukernel #1107

Conversation

jtuyls commented Feb 14, 2025

jtuyls Feb 14, 2025

Choose a reason for hiding this comment

jtuyls Feb 14, 2025

Choose a reason for hiding this comment

Yu-Zhewen left a comment

Choose a reason for hiding this comment

newling commented Feb 14, 2025

jtuyls commented Feb 14, 2025