Arm backend: Remove fast scratch part for now #10958

kirklandsign · 2025-05-17T18:35:58Z

Summary: Fix CI

Differential Revision: D74939323

Summary: Fix CI Differential Revision: D74939323

pytorch-bot · 2025-05-17T18:36:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10958

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 116eaf2 with merge base 9aaea31 ():

NEW FAILURE - The following job has failed:

trunk / test-arm-backend (test_models_ethos-u85) / linux-job (gh)
RuntimeError: Command docker exec -t f37dc82010b23500f4ea8477a4b91ab19782f5b3ad86d8df39225fa2023f6d21 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-05-17T18:36:07Z

This pull request was exported from Phabricator. Differential Revision: D74939323

zingo · 2025-05-17T19:37:40Z

Hi, sorry for the break, from our (Arm) point just do what is easier for you to just make it work. Revert or this fix and we will look into this when back to work next week.

To understand the problem you see better and if you know, is it so that you have a different runner then our example runner you are testing with?

zingo

Ok, to just make it work for you.

kirklandsign · 2025-05-17T20:28:04Z

To understand the problem you see better and if you know, is it so that you have a different runner then our example runner you are testing with?

Seems that our internal CI has a different runner than the example runner.

facebook-github-bot · 2025-05-17T20:31:03Z

@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zingo · 2025-05-17T20:41:20Z

** > Seems that our internal CI has a different runner than the example runner.**

Thank that makes sense as you get this problem, and something we will try to keep in mind.

zingo · 2025-05-18T07:29:14Z

Regarding the failed Arm unit model test it is unrelated an a fix for it is (hopefully) here
#10953
Just waiting for review

kirklandsign · 2025-05-19T02:28:11Z

Hi @zingo seems that the test-arm-backend (test_models_ethos-u85) / linux-job always failed after this patch.

https://github.com/pytorch/executorch/actions/runs/15101696584/job/42443518087

Could you please suggest how to fix?

gggekov · 2025-05-19T10:23:58Z

Hi @kirklandsign,
Yes, when you pass nullptr & 0 as base address/base address size for the U85 in Dedicated_Sram memory mode, the inference hangs as you see in your CI trace - that behaviour is expected. Dedicated_Sram means that the NPU uses the fast scratch buffer as a buffer to store the most commonly accessed intermediate tensors. With your fix, you are still testing Dedicated_Sram on U85, but you are not providing fast scratch array, hence the NPU hangs.

Can you please expand a bit more on what you mean when you say you use a different arm_executor_runner.cpp in your internal test suite - do you have a way to define the ethosu_fast_scratch/ethosu_fast_scratch_size to be 0 or 384KB depending on if we test U55(Shared_Sram) or U85(Dedicated_Sram) in your internal arm_executor_runner.cpp?

The U85 is designed to enable NNs where the scratch buffer is too big to fit into the SRAM and hence we place it in the DDR. In order to achieve good performance, it is important to still "dedicate" small amount of SRAM where the NPU can R/W the most commonly accessed tensors of the NN. That behaviour is enabled by the Dedicated_Sram memory mode, so it's important we support it properly.

Temporary solution to the problem in pytorch#10958 The arm_executor_runner.cpp need to declare the ethosu_fast_scratch array and pass it onto to the EthosUBackend.cpp. It is important that for Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram it points to the fast memory array. Change-Id: I808203fb7b9b6e5bece92c4cc5079f22bd802d95

hsharma35 · 2025-05-19T17:49:23Z

@kirklandsign @gggekov Can these variable be added to compile spec instead or declaring them in arm_executor_runner.cpp?

kirklandsign · 2025-05-19T18:49:50Z

@zingo @hsharma35 Sorry I am not the best contact for this issue. I don't work on this and just ran into an internal error. Could use @digantdesai 's help

…Ethos-U85 (#10973) Temporary solution to the problem in #10958 The arm_executor_runner.cpp need to declare the ethosu_fast_scratch array and pass it onto to the EthosUBackend.cpp. It is important that for Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram it points to the fast memory array.

Differential Revision: D74939323 Pull Request resolved: pytorch#10958

…Ethos-U85 (pytorch#10973) Temporary solution to the problem in pytorch#10958 The arm_executor_runner.cpp need to declare the ethosu_fast_scratch array and pass it onto to the EthosUBackend.cpp. It is important that for Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram it points to the fast memory array.

gggekov · 2025-05-21T16:41:10Z

Hi @hsharma35,

The EthosUBackend.cpp currently doesn't know if the NN has been compiled for Shared_Sram or Dedicated_Sram or another memory mode, but the arm_executor_runner.cpp knows that thanks to the propagation of parameters in the executorch/examples/arm/executor_runner/CMakeLists.txt. For pte generated for Shared_Sram & Sram_Only, we only need to pass 2 base pointers/base pointer sizes, but for Dedicated_Sram we need to pass a third base pointer towards the fast scratch array, hence I declared the fast scratch in the arm_executor_runner depending on the memory mode. Then, i passed the fast scratch(nullptr or valid address) as an extern at link time to the EthosUBackend.cpp

What's the best way to enable dedicated_sram in the runtime & make your internal CI pass ?

CC @digantdesai @kirklandsign

digantdesai · 2025-05-22T16:59:24Z

Can these variable be added to compile spec instead or declaring them in arm_executor_runner.cpp?

@hsharma35 not sure if compile_spec is the way to do this, esp if the delegate runtime wants to know the pointer at runtime based on the runner setup.

We have a flag for specifying the memory_mode from PTE generation time, same used by the CMake IIUC. But for the scratch ptr, size info originates at runtime in the runner (through user specified cmake knobs), and we can't easily pass it in the delegate::init() from a runner.

Here @gggekov used externs but I am not sure if that's the right approach because it leads to precisely these issues where you are now connecting a runner with a delegate - while they shouldn't know about each other given they sit in a different abstraction layers.

We are working on et::backend_configs which may help in this but it is not ready yet. #10216

Let me think some more. For now can we guard the delegate runtime variables such that it is backwards compatible for some runner which doesn't care about this memory_mode? @gggekov

gggekov · 2025-05-27T15:59:34Z

Thanks @digantdesai , you are right the extern couples the arm_executor_runner.cpp & EthosUBackend.cpp in a way that we should avoid.

How about if we use a weak symbol for the fast scratch array, by default set to it nullptr in the EthosUBackend.cpp? That corresponds to Shared_Sram for U55,65 & U85. In case of a U85 SoC where we want to use the DRAM(Dedicated_Sram), overwrite the weak symbol in the arm_executor_runner.cpp with the correct array for the cache. I believe this should pass your internal CI and is close to a real system where for Dedicated_Sram, the user needs to carve out specified amount of memory for the NPU.

Regarding #10216, in principle it may be useful, but I don't think we need a backend specific configuration to enable the Dedicated_Sram mode on the U85. Also, note that in executorch/backends/arm/arm_vela.py, in the vela npz file you have the size of the scratch buffer and the fast scratch buffer. For Shared_Sram/Sram_Only, the fast_scratch_buffer is 0 and for Dedicated_Sram, it is equal to the amount of SRAM that the NPU can utilise.

digantdesai · 2025-05-30T18:08:28Z

I think weak symbol override is fine. As long as we don't tightly couple the delegate runtime vs. app runner.

I don't think we need a backend specific configuration to enable the Dedicated_Sram mode on the U85

Generally speaking, I see this mechanism as a cleaner mechanism to pass information from outside to a delegate especially when it originates at runtime. For compile-time information we can use other mechanisms like compile_spec, preprocessor macros/variables, etc.

gggekov · 2025-06-06T21:07:03Z

FYI - here is a pr(#11459) with the suggested fix with the weak symbol

Remove fast scratch part for now

7482161

Summary: Fix CI Differential Revision: D74939323

kirklandsign requested a review from digantdesai as a code owner May 17, 2025 18:35

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 17, 2025

facebook-github-bot added the fb-exported label May 17, 2025

kirklandsign requested review from zingo and gggekov May 17, 2025 18:36

kirklandsign added the topic: not user facing label May 17, 2025

kirklandsign mentioned this pull request May 17, 2025

Arm backend: Allocate the scratch buffer runtime rather than in the pte #10714

Merged

kirklandsign added the release notes: none Do not include this in the release notes label May 17, 2025

zingo approved these changes May 17, 2025

View reviewed changes

Fix

116eaf2

zingo added ciflow/trunk module: arm Issues related to arm backend labels May 18, 2025

zingo changed the title ~~Remove fast scratch part for now~~ Arm backend: Remove fast scratch part for now May 18, 2025

facebook-github-bot merged commit 7d9dd46 into main May 19, 2025
434 of 440 checks passed

facebook-github-bot deleted the export-D74939323 branch May 19, 2025 00:33

gggekov mentioned this pull request May 19, 2025

Arm backend: Make the CI green by not testing Dedicated_Sram for the Ethos-U85 #10973

Merged

hinriksnaer pushed a commit to hinriksnaer/executorch that referenced this pull request May 19, 2025

Arm backend: Remove fast scratch part for now

8c8bcdf

Differential Revision: D74939323 Pull Request resolved: pytorch#10958

Arm backend: Remove fast scratch part for now #10958

Arm backend: Remove fast scratch part for now #10958

Uh oh!

Conversation

kirklandsign commented May 17, 2025

Uh oh!

pytorch-bot bot commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10958

❌ 1 New Failure

Uh oh!

facebook-github-bot commented May 17, 2025

Uh oh!

zingo commented May 17, 2025

Uh oh!

zingo left a comment

Choose a reason for hiding this comment

Uh oh!

kirklandsign commented May 17, 2025

Uh oh!

facebook-github-bot commented May 17, 2025

Uh oh!

zingo commented May 17, 2025

Uh oh!

zingo commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kirklandsign commented May 19, 2025

Uh oh!

gggekov commented May 19, 2025

Uh oh!

hsharma35 commented May 19, 2025

Uh oh!

kirklandsign commented May 19, 2025

Uh oh!

gggekov commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

digantdesai commented May 22, 2025

Uh oh!

gggekov commented May 27, 2025

Uh oh!

digantdesai commented May 30, 2025

Uh oh!

gggekov commented Jun 6, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented May 17, 2025 •

edited

Loading

zingo commented May 18, 2025 •

edited

Loading

gggekov commented May 21, 2025 •

edited

Loading