-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UKernel] Improve peano zeroization ukernel #1107
Conversation
v16int32 zeros = broadcast_zero_to_v16int32(); | ||
for (unsigned i = offsetC / r; i < offsetC / r + M * N / r; i++) { | ||
pC[i] = zeros; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the -fno-builtin-memset
flag we get the following assembly with scalar stores as the memset is recognized and converted into a libcall, which works on bytes, because we don't know the alignment (got this info from peano folks):
00000000 <_Z15zero_vectorizedIDv16_iLi32ELi32ELi16EEvPT_j>:
0: e1 00 00 70 00 00 00 02 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; paddxm [sp], #0x40; nopv
10: 7e 60 2b 00 08 00 00 00 00 20 00 00 41 00 mova r1, #0x2; nopb ; jl #0x0; nops
1e: 2c 3b 00 00 01 f8 mova r1, #-0x40; lshl r0, r0, r1
24: 98 14 00 10 and r0, r0, r1
28: 02 70 10 00 00 b0 07 f8 st lr, [sp, #-64]; mov m0, r0
30: ba 10 00 28 04 00 00 f0 0c 01 padda [p0], m0; movxm r1, #0x1000
3a: d4 81 c1 02 00 00 mova r0, #0x0; mov p1, p0
40: e1 00 00 00 00 00 00 00 00 5b 01 20 00 20 07 f8 lda lr, [sp, #-64]; nopb ; nops ; nopxm ; nopv
50: 18 00 00 10 nopx
...
5c: 00 00 nop
5e: 18 00 28 10 ret lr
62: 00 00 nop
64: 00 00 nop
66: 00 00 nop
68: c4 01 00 00 f8 ff paddxm [sp], #-0x40
6e: 00 00 nop
But with the flag, we see the vectorized store (vst
)
00000000 <_Z15zero_vectorizedIDv16_iLi32ELi32ELi16EEvPT_j>:
0: e1 00 00 10 00 78 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; movxm ls, #0x0; nopv
10: b6 10 00 b8 01 00 00 20 00 00 02 08 mova r2, #0x40; nopb ; movxm le, #0x0
1c: 18 00 71 1d add.nc lc, r2, #0x0
20: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv
30: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv
40: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv
50: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv
60: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv
70: e1 00 00 78 a5 01 08 10 00 5b 01 20 00 00 83 ff mova r3, #-0x4; nopb ; nops ; movx r1, #0x0; nopm ; nopv
80: e1 00 00 78 39 03 ec 01 00 5b 01 20 00 00 c2 00 mova r2, #0x6; nopb ; nops ; lshl r0, r0, r3; vbcst.32 x0, r1; nopv
00000090 <.LBB3_1>:
90: 92 10 06 20 00 f0 2c 00 nopa ; nopb ; add r3, r0, r1
98: 98 2d c6 10 lshl r3, r3, r2
9c: f8 a0 81 18 mov dj0, r3
000000a0 <.L_LEnd0>:
a0: e1 00 00 78 a5 01 38 10 02 13 00 20 00 f0 2c 00 nopa ; nopb ; vst x0, [p0, dj0]; add r1, r1, #0x1; nopm ; nopv
b0: 2c 00 50 f0 2c 00 nopa ; ret lr
...
be: 00 00 nop
@@ -25,6 +22,15 @@ void matmul_vectorized_i8_i32(const int8 * __restrict pA, unsigned offsetA, cons | |||
const unsigned size_B = L0_K * L0_N; | |||
const unsigned size_C = L0_M * L0_N; | |||
|
|||
v64int8 A0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen issues with loop pipelining if these are defined in the inner loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Nice find. I see vst generated in https://github.com/nod-ai/iree-amd-aie/pull/1095/files without this flag, maybe because the LLVMIR already has llvm.stores of the correct size (?). |
Yeah, maybe.. I need to check the ll of these examples I have above. There could be other reasons for the peano not recognizing the memset and/or if the loop extent is small enough the vstores are also generated, for example:
compiles to (without
|
No description provided.