Skip to content

Arm backend: Remove fast scratch part for now #10958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 19, 2025

Conversation

kirklandsign
Copy link
Contributor

Summary: Fix CI

Differential Revision: D74939323

Summary: Fix CI

Differential Revision: D74939323
@kirklandsign kirklandsign requested a review from digantdesai as a code owner May 17, 2025 18:35
Copy link

pytorch-bot bot commented May 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10958

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 116eaf2 with merge base 9aaea31 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 17, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74939323

@zingo
Copy link
Collaborator

zingo commented May 17, 2025

Hi, sorry for the break, from our (Arm) point just do what is easier for you to just make it work. Revert or this fix and we will look into this when back to work next week.

To understand the problem you see better and if you know, is it so that you have a different runner then our example runner you are testing with?

Copy link
Collaborator

@zingo zingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, to just make it work for you.

@kirklandsign
Copy link
Contributor Author

To understand the problem you see better and if you know, is it so that you have a different runner then our example runner you are testing with?

Seems that our internal CI has a different runner than the example runner.

@facebook-github-bot
Copy link
Contributor

@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@zingo
Copy link
Collaborator

zingo commented May 17, 2025

** > Seems that our internal CI has a different runner than the example runner.**

Thank that makes sense as you get this problem, and something we will try to keep in mind.

@zingo
Copy link
Collaborator

zingo commented May 18, 2025

Regarding the failed Arm unit model test it is unrelated an a fix for it is (hopefully) here
#10953
Just waiting for review

@zingo zingo added ciflow/trunk module: arm Issues related to arm backend labels May 18, 2025
@zingo zingo changed the title Remove fast scratch part for now Arm backend: Remove fast scratch part for now May 18, 2025
@facebook-github-bot facebook-github-bot merged commit 7d9dd46 into main May 19, 2025
434 of 440 checks passed
@facebook-github-bot facebook-github-bot deleted the export-D74939323 branch May 19, 2025 00:33
@kirklandsign
Copy link
Contributor Author

Hi @zingo seems that the test-arm-backend (test_models_ethos-u85) / linux-job always failed after this patch.

https://github.com/pytorch/executorch/actions/runs/15101696584/job/42443518087

Could you please suggest how to fix?

@gggekov
Copy link
Collaborator

gggekov commented May 19, 2025

Hi @kirklandsign,
Yes, when you pass nullptr & 0 as base address/base address size for the U85 in Dedicated_Sram memory mode, the inference hangs as you see in your CI trace - that behaviour is expected. Dedicated_Sram means that the NPU uses the fast scratch buffer as a buffer to store the most commonly accessed intermediate tensors. With your fix, you are still testing Dedicated_Sram on U85, but you are not providing fast scratch array, hence the NPU hangs.

Can you please expand a bit more on what you mean when you say you use a different arm_executor_runner.cpp in your internal test suite - do you have a way to define the ethosu_fast_scratch/ethosu_fast_scratch_size to be 0 or 384KB depending on if we test U55(Shared_Sram) or U85(Dedicated_Sram) in your internal arm_executor_runner.cpp?

The U85 is designed to enable NNs where the scratch buffer is too big to fit into the SRAM and hence we place it in the DDR. In order to achieve good performance, it is important to still "dedicate" small amount of SRAM where the NPU can R/W the most commonly accessed tensors of the NN. That behaviour is enabled by the Dedicated_Sram memory mode, so it's important we support it properly.

gggekov added a commit to gggekov/executorch that referenced this pull request May 19, 2025
Temporary solution to the problem in pytorch#10958
The arm_executor_runner.cpp need to declare the ethosu_fast_scratch array and
pass it onto to the EthosUBackend.cpp. It is important that for Shared_Sram,
the ethosu_fast_scratch is nullptr and for Dedicated_Sram it points to the
fast memory array.

Change-Id: I808203fb7b9b6e5bece92c4cc5079f22bd802d95
@hsharma35
Copy link
Contributor

@kirklandsign @gggekov Can these variable be added to compile spec instead or declaring them in arm_executor_runner.cpp?

@kirklandsign
Copy link
Contributor Author

@zingo @hsharma35 Sorry I am not the best contact for this issue. I don't work on this and just ran into an internal error. Could use @digantdesai 's help

zingo pushed a commit that referenced this pull request May 19, 2025
…Ethos-U85 (#10973)

Temporary solution to the problem in
#10958 The
arm_executor_runner.cpp need to declare the ethosu_fast_scratch array
and pass it onto to the EthosUBackend.cpp. It is important that for
Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram
it points to the fast memory array.
hinriksnaer pushed a commit to hinriksnaer/executorch that referenced this pull request May 19, 2025
Differential Revision: D74939323

Pull Request resolved: pytorch#10958
hinriksnaer pushed a commit to hinriksnaer/executorch that referenced this pull request May 19, 2025
…Ethos-U85 (pytorch#10973)

Temporary solution to the problem in
pytorch#10958 The
arm_executor_runner.cpp need to declare the ethosu_fast_scratch array
and pass it onto to the EthosUBackend.cpp. It is important that for
Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram
it points to the fast memory array.
@gggekov
Copy link
Collaborator

gggekov commented May 21, 2025

Hi @hsharma35,

The EthosUBackend.cpp currently doesn't know if the NN has been compiled for Shared_Sram or Dedicated_Sram or another memory mode, but the arm_executor_runner.cpp knows that thanks to the propagation of parameters in the executorch/examples/arm/executor_runner/CMakeLists.txt. For pte generated for Shared_Sram & Sram_Only, we only need to pass 2 base pointers/base pointer sizes, but for Dedicated_Sram we need to pass a third base pointer towards the fast scratch array, hence I declared the fast scratch in the arm_executor_runner depending on the memory mode. Then, i passed the fast scratch(nullptr or valid address) as an extern at link time to the EthosUBackend.cpp

What's the best way to enable dedicated_sram in the runtime & make your internal CI pass ?

CC @digantdesai @kirklandsign

@digantdesai
Copy link
Contributor

Can these variable be added to compile spec instead or declaring them in arm_executor_runner.cpp?

@hsharma35 not sure if compile_spec is the way to do this, esp if the delegate runtime wants to know the pointer at runtime based on the runner setup.

We have a flag for specifying the memory_mode from PTE generation time, same used by the CMake IIUC. But for the scratch ptr, size info originates at runtime in the runner (through user specified cmake knobs), and we can't easily pass it in the delegate::init() from a runner.

Here @gggekov used externs but I am not sure if that's the right approach because it leads to precisely these issues where you are now connecting a runner with a delegate - while they shouldn't know about each other given they sit in a different abstraction layers.

We are working on et::backend_configs which may help in this but it is not ready yet. #10216

Let me think some more. For now can we guard the delegate runtime variables such that it is backwards compatible for some runner which doesn't care about this memory_mode? @gggekov

@gggekov
Copy link
Collaborator

gggekov commented May 27, 2025

Thanks @digantdesai , you are right the extern couples the arm_executor_runner.cpp & EthosUBackend.cpp in a way that we should avoid.

How about if we use a weak symbol for the fast scratch array, by default set to it nullptr in the EthosUBackend.cpp? That corresponds to Shared_Sram for U55,65 & U85. In case of a U85 SoC where we want to use the DRAM(Dedicated_Sram), overwrite the weak symbol in the arm_executor_runner.cpp with the correct array for the cache. I believe this should pass your internal CI and is close to a real system where for Dedicated_Sram, the user needs to carve out specified amount of memory for the NPU.

Regarding #10216, in principle it may be useful, but I don't think we need a backend specific configuration to enable the Dedicated_Sram mode on the U85. Also, note that in executorch/backends/arm/arm_vela.py, in the vela npz file you have the size of the scratch buffer and the fast scratch buffer. For Shared_Sram/Sram_Only, the fast_scratch_buffer is 0 and for Dedicated_Sram, it is equal to the amount of SRAM that the NPU can utilise.

@digantdesai
Copy link
Contributor

I think weak symbol override is fine. As long as we don't tightly couple the delegate runtime vs. app runner.

I don't think we need a backend specific configuration to enable the Dedicated_Sram mode on the U85

Generally speaking, I see this mechanism as a cleaner mechanism to pass information from outside to a delegate especially when it originates at runtime. For compile-time information we can use other mechanisms like compile_spec, preprocessor macros/variables, etc.

@gggekov
Copy link
Collaborator

gggekov commented Jun 6, 2025

FYI - here is a pr(#11459) with the suggested fix with the weak symbol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported module: arm Issues related to arm backend release notes: none Do not include this in the release notes topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants