Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UKernel] Improve peano zeroization ukernel #1107

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

jtuyls
Copy link
Collaborator

@jtuyls jtuyls commented Feb 14, 2025

No description provided.

Comment on lines +12 to +14
v16int32 zeros = broadcast_zero_to_v16int32();
for (unsigned i = offsetC / r; i < offsetC / r + M * N / r; i++) {
pC[i] = zeros;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the -fno-builtin-memset flag we get the following assembly with scalar stores as the memset is recognized and converted into a libcall, which works on bytes, because we don't know the alignment (got this info from peano folks):

00000000 <_Z15zero_vectorizedIDv16_iLi32ELi32ELi16EEvPT_j>:
       0: e1 00 00 70 00 00 00 02 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               paddxm   [sp], #0x40;           nopv
      10: 7e 60 2b 00 08 00 00 00 00 20 00 00 41 00     mova    r1, #0x2;               nopb    ;               jl      #0x0;           nops
      1e: 2c 3b 00 00 01 f8     mova    r1, #-0x40;             lshl     r0, r0, r1
      24: 98 14 00 10   and      r0, r0, r1
      28: 02 70 10 00 00 b0 07 f8       st       lr, [sp, #-64];                mov     m0, r0
      30: ba 10 00 28 04 00 00 f0 0c 01 padda    [p0], m0;              movxm   r1, #0x1000
      3a: d4 81 c1 02 00 00     mova    r0, #0x0;               mov     p1, p0
      40: e1 00 00 00 00 00 00 00 00 5b 01 20 00 20 07 f8       lda      lr, [sp, #-64];                nopb    ;               nops    ;               nopxm   ;               nopv
      50: 18 00 00 10   nopx
                ...
      5c: 00 00         nop
      5e: 18 00 28 10   ret     lr
      62: 00 00         nop
      64: 00 00         nop
      66: 00 00         nop
      68: c4 01 00 00 f8 ff     paddxm   [sp], #-0x40
      6e: 00 00         nop

But with the flag, we see the vectorized store (vst)

00000000 <_Z15zero_vectorizedIDv16_iLi32ELi32ELi16EEvPT_j>:
       0: e1 00 00 10 00 78 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               movxm   ls, #0x0;               nopv
      10: b6 10 00 b8 01 00 00 20 00 00 02 08   mova    r2, #0x40;              nopb    ;               movxm   le, #0x0
      1c: 18 00 71 1d   add.nc  lc, r2, #0x0
      20: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv
      30: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv
      40: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv
      50: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv
      60: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv
      70: e1 00 00 78 a5 01 08 10 00 5b 01 20 00 00 83 ff       mova    r3, #-0x4;              nopb    ;               nops    ;               movx    r1, #0x0;               nopm    ;               nopv
      80: e1 00 00 78 39 03 ec 01 00 5b 01 20 00 00 c2 00       mova    r2, #0x6;               nopb    ;               nops    ;               lshl     r0, r0, r3;            vbcst.32         x0, r1;                nopv

00000090 <.LBB3_1>:
      90: 92 10 06 20 00 f0 2c 00       nopa    ;               nopb    ;               add      r3, r0, r1
      98: 98 2d c6 10   lshl     r3, r3, r2
      9c: f8 a0 81 18   mov     dj0, r3

000000a0 <.L_LEnd0>:
      a0: e1 00 00 78 a5 01 38 10 02 13 00 20 00 f0 2c 00       nopa    ;               nopb    ;               vst      x0, [p0, dj0];         add     r1, r1, #0x1;           nopm    ;               nopv
      b0: 2c 00 50 f0 2c 00     nopa    ;               ret     lr
                ...
      be: 00 00         nop

@@ -25,6 +22,15 @@ void matmul_vectorized_i8_i32(const int8 * __restrict pA, unsigned offsetA, cons
const unsigned size_B = L0_K * L0_N;
const unsigned size_C = L0_M * L0_N;

v64int8 A0;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen issues with loop pipelining if these are defined in the inner loops.

Copy link
Contributor

@Yu-Zhewen Yu-Zhewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@newling
Copy link
Contributor

newling commented Feb 14, 2025

Nice find. I see vst generated in https://github.com/nod-ai/iree-amd-aie/pull/1095/files without this flag, maybe because the LLVMIR already has llvm.stores of the correct size (?).

@jtuyls
Copy link
Collaborator Author

jtuyls commented Feb 14, 2025

Nice find. I see vst generated in https://github.com/nod-ai/iree-amd-aie/pull/1095/files without this flag, maybe because the LLVMIR already has llvm.stores of the correct size (?).

Yeah, maybe.. I need to check the ll of these examples I have above. There could be other reasons for the peano not recognizing the memset and/or if the loop extent is small enough the vstores are also generated, for example:

void zero_vectorized(v16int32 *__restrict pC)
{
  v16int32 zeros = broadcast_zero_s32();
  for (unsigned i = 0; i < 8; i++) {
    pC[i] = zeros;
  }
}

compiles to (without -fno-builtin-memset):

00000000 <_Z15zero_vectorizedIDv16_iLi32ELi32ELi16EEvPT_j>:
      0: 2c 00 00 00 00 00     mova    r0, #0x0;               nopx
      6: f8 72 02 18   vbcst.32         x0, r0
      a: 00 00         nop
      c: 98 2a 1c 08   vst      wl0, [p0], #0x20
     10: 98 2a 1c 08   vst      wl0, [p0], #0x20
     14: 98 2a 1c 08   vst      wl0, [p0], #0x20
     18: 98 2a 1c 08   vst      wl0, [p0], #0x20
     1c: 98 2a 1c 08   vst      wl0, [p0], #0x20
     20: 98 2a 1c 08   vst      wl0, [p0], #0x20
     24: 98 2a 1c 08   vst      wl0, [p0], #0x20
     28: 98 2a 1c 08   vst      wl0, [p0], #0x20
     2c: 98 2a 1c 08   vst      wl0, [p0], #0x20
     30: 98 2a 1c 08   vst      wl0, [p0], #0x20
     34: 98 2a 1c 08   vst      wl0, [p0], #0x20
     38: 5c 00 50 50 85 03     vst      wl0, [p0], #0x20;              ret     lr
     3e: 98 2a 1c 08   vst      wl0, [p0], #0x20
     42: 98 2a 1c 08   vst      wl0, [p0], #0x20
     46: 98 2a 1c 08   vst      wl0, [p0], #0x20
     4a: 98 2a 04 08   vst      wl0, [p0, #0x0]
     4e: 00 00         nop

@jtuyls jtuyls merged commit 14ccb7e into nod-ai:main Feb 14, 2025
7 checks passed
@jtuyls jtuyls deleted the improve-peano-zero-ukernel branch February 14, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants