Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Print program memory usage (PM) along with kernel time #953

Merged
merged 6 commits into from
Dec 5, 2024

Conversation

newling
Copy link
Contributor

@newling newling commented Dec 3, 2024

This will be useful to see how close we are to overflowing program memory (16384 B). In the case where we do overflow, we won't see how much PM we require. For that, we'll need to add a check in aie2xclbin -- future work.

UPDATE: the size if the .elf file isn't what we want, the elf file contains other stuff. I discovered that the program memory size is stored at byte 72 of the elf (see newling/aie-rt@d0f08bc for how I found this)

Copy link
Collaborator

@jtuyls jtuyls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great to keep track of this as well!

@newling
Copy link
Contributor Author

newling commented Dec 3, 2024

Something isn't as expected, the printed ranges for elf sizes are too large, https://github.com/nod-ai/iree-amd-aie/actions/runs/12147431541/job/33875335756

32 KB and still working?

@jtuyls
Copy link
Collaborator

jtuyls commented Dec 3, 2024

Something isn't as expected, the printed ranges for elf sizes are too large, https://github.com/nod-ai/iree-amd-aie/actions/runs/12147431541/job/33875335756

32 KB and still working?

Yeah, those sizes don't look correct.

@yzhang93
Copy link
Contributor

yzhang93 commented Dec 3, 2024

Something isn't as expected, the printed ranges for elf sizes are too large, https://github.com/nod-ai/iree-amd-aie/actions/runs/12147431541/job/33875335756

32 KB and still working?

So here the assumption is the elf file size exactly equals to program memory usage? Not sure if it has any other contents which makes it larger than program memory.

@newling
Copy link
Contributor Author

newling commented Dec 3, 2024

Something isn't as expected, the printed ranges for elf sizes are too large, https://github.com/nod-ai/iree-amd-aie/actions/runs/12147431541/job/33875335756
32 KB and still working?

So here the assumption is the elf file size exactly equals to program memory usage? Not sure if it has any other contents which makes it larger than program memory.

Yeah that was my assumption, and it's incorrect. Ultimately we want the number checked here, where the error is thrown we go OOM: https://github.com/Xilinx/aie-rt/blob/3fdf8d8d9a7a2abcd8df76155f950680607eac8a/driver/src/core/xaie_elfloader.c#L229

but I don't think there's a simple way to get that. The call stack down to there looks like

// in the iree-amd-aie side:
aie2xclbin ->
generateCDO ->
AIETranslateToCDODirect ->
generateCDOBinariesSeparately ->
addAieElfsToCDO ->
addElfToTile ->

// on the aie_rt side:
XAie_LoadElf ->
XAie_LoadElfPartial ->
_XAie_LoadElfFromMem ->
_XAie_WriteProgramSection ->
_XAie_LoadProgMemSection

Those last 2 calls are slicing into the elf C object using pointers which I don't think we can reasonably access. That is on the side where the elf is used to create the CDO. Before that there's a step for creating the elf (see aie2xclbin function), but I don't see anything in there either which will give us the magic number.

@yzhang93
Copy link
Contributor

yzhang93 commented Dec 4, 2024

Yeah that was my assumption, and it's incorrect. Ultimately we want the number checked here, where the error is thrown we go OOM: https://github.com/Xilinx/aie-rt/blob/3fdf8d8d9a7a2abcd8df76155f950680607eac8a/driver/src/core/xaie_elfloader.c#L229

but I don't think there's a simple way to get that. The call stack down to there looks like

// in the iree-amd-aie side: aie2xclbin -> generateCDO -> AIETranslateToCDODirect -> generateCDOBinariesSeparately -> addAieElfsToCDO -> addElfToTile ->

// on the aie_rt side: XAie_LoadElf -> XAie_LoadElfPartial -> _XAie_LoadElfFromMem -> _XAie_WriteProgramSection -> _XAie_LoadProgMemSection

Those last 2 calls are slicing into the elf C object using pointers which I don't think we can reasonably access. That is on the side where the elf is used to create the CDO. Before that there's a step for creating the elf (see aie2xclbin function), but I don't see anything in there either which will give us the magic number.

I see. So if this number doesn't accurately show the PM usage, I think it's not very useful to print it out.

@newling
Copy link
Contributor Author

newling commented Dec 4, 2024

I see. So if this number doesn't accurately show the PM usage, I think it's not very useful to print it out.

Yup. I'm going to do some experiments to see if it as a good indicator of PM though, and if it is then it might be useful.

@newling newling force-pushed the print_program_memory_in_ci branch from 4010452 to 2641d30 Compare December 5, 2024 00:24
@newling newling requested review from jtuyls and yzhang93 December 5, 2024 00:28
@newling
Copy link
Contributor Author

newling commented Dec 5, 2024

@jtuyls @yzhang93 the value printed is exactly the program memory, turns out the program memory size is stored at a fixed offset in the elf files. Please re-review because this if you like

@jtuyls
Copy link
Collaborator

jtuyls commented Dec 5, 2024

@jtuyls @yzhang93 the value printed is exactly the program memory, turns out the program memory size is stored at a fixed offset in the elf files. Please re-review because this if you like

Thanks, looks good to me. We could consider throwing errors to catch it when this changes, but it won't be a 100% guarantee. We could check for example that the value is in range ]0, 16KB] or even [1KB, 16KB] or so.

@newling
Copy link
Contributor Author

newling commented Dec 5, 2024

@jtuyls @yzhang93 the value printed is exactly the program memory, turns out the program memory size is stored at a fixed offset in the elf files. Please re-review because this if you like

Thanks, looks good to me. We could consider throwing errors to catch it when this changes, but it won't be a 100% guarantee. We could check for example that the value is in range ]0, 16KB] or even [1KB, 16KB] or so.

Makes sense, I'll add a check for <= 16384 bytes. I realized that my comment in the description that we'll see this number even when we go OOM is wrong. For that, we'd need to analyze the elf in aie2xclbin. Follow-up work...

@newling
Copy link
Contributor Author

newling commented Dec 5, 2024

or even [1KB, 16KB]

For the tiny matmul_truncf_16_16_bf16_f32 test the memory is only 768 bytes!

@newling newling enabled auto-merge (squash) December 5, 2024 18:00
@newling newling merged commit 9469cae into nod-ai:main Dec 5, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants