-
Notifications
You must be signed in to change notification settings - Fork 595
Arm backend: Remove fast scratch part for now #10958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Fix CI Differential Revision: D74939323
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10958
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 116eaf2 with merge base 9aaea31 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D74939323 |
Hi, sorry for the break, from our (Arm) point just do what is easier for you to just make it work. Revert or this fix and we will look into this when back to work next week. To understand the problem you see better and if you know, is it so that you have a different runner then our example runner you are testing with? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, to just make it work for you.
Seems that our internal CI has a different runner than the example runner. |
@kirklandsign has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
** > Seems that our internal CI has a different runner than the example runner.** Thank that makes sense as you get this problem, and something we will try to keep in mind. |
Regarding the failed Arm unit model test it is unrelated an a fix for it is (hopefully) here |
Hi @zingo seems that the test-arm-backend (test_models_ethos-u85) / linux-job always failed after this patch. https://github.com/pytorch/executorch/actions/runs/15101696584/job/42443518087 Could you please suggest how to fix? |
Hi @kirklandsign, Can you please expand a bit more on what you mean when you say you use a different arm_executor_runner.cpp in your internal test suite - do you have a way to define the ethosu_fast_scratch/ethosu_fast_scratch_size to be 0 or 384KB depending on if we test U55(Shared_Sram) or U85(Dedicated_Sram) in your internal arm_executor_runner.cpp? The U85 is designed to enable NNs where the scratch buffer is too big to fit into the SRAM and hence we place it in the DDR. In order to achieve good performance, it is important to still "dedicate" small amount of SRAM where the NPU can R/W the most commonly accessed tensors of the NN. That behaviour is enabled by the Dedicated_Sram memory mode, so it's important we support it properly. |
Temporary solution to the problem in pytorch#10958 The arm_executor_runner.cpp need to declare the ethosu_fast_scratch array and pass it onto to the EthosUBackend.cpp. It is important that for Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram it points to the fast memory array. Change-Id: I808203fb7b9b6e5bece92c4cc5079f22bd802d95
@kirklandsign @gggekov Can these variable be added to compile spec instead or declaring them in |
@zingo @hsharma35 Sorry I am not the best contact for this issue. I don't work on this and just ran into an internal error. Could use @digantdesai 's help |
…Ethos-U85 (#10973) Temporary solution to the problem in #10958 The arm_executor_runner.cpp need to declare the ethosu_fast_scratch array and pass it onto to the EthosUBackend.cpp. It is important that for Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram it points to the fast memory array.
Differential Revision: D74939323 Pull Request resolved: pytorch#10958
…Ethos-U85 (pytorch#10973) Temporary solution to the problem in pytorch#10958 The arm_executor_runner.cpp need to declare the ethosu_fast_scratch array and pass it onto to the EthosUBackend.cpp. It is important that for Shared_Sram, the ethosu_fast_scratch is nullptr and for Dedicated_Sram it points to the fast memory array.
Hi @hsharma35, The EthosUBackend.cpp currently doesn't know if the NN has been compiled for Shared_Sram or Dedicated_Sram or another memory mode, but the arm_executor_runner.cpp knows that thanks to the propagation of parameters in the executorch/examples/arm/executor_runner/CMakeLists.txt. For pte generated for Shared_Sram & Sram_Only, we only need to pass 2 base pointers/base pointer sizes, but for Dedicated_Sram we need to pass a third base pointer towards the fast scratch array, hence I declared the fast scratch in the arm_executor_runner depending on the memory mode. Then, i passed the fast scratch(nullptr or valid address) as an extern at link time to the EthosUBackend.cpp What's the best way to enable dedicated_sram in the runtime & make your internal CI pass ? |
@hsharma35 not sure if compile_spec is the way to do this, esp if the delegate runtime wants to know the pointer at runtime based on the runner setup. We have a flag for specifying the memory_mode from PTE generation time, same used by the CMake IIUC. But for the scratch ptr, size info originates at runtime in the runner (through user specified cmake knobs), and we can't easily pass it in the delegate::init() from a runner. Here @gggekov used We are working on Let me think some more. For now can we guard the delegate runtime variables such that it is backwards compatible for some runner which doesn't care about this memory_mode? @gggekov |
Thanks @digantdesai , you are right the extern couples the arm_executor_runner.cpp & EthosUBackend.cpp in a way that we should avoid. How about if we use a weak symbol for the fast scratch array, by default set to it nullptr in the EthosUBackend.cpp? That corresponds to Shared_Sram for U55,65 & U85. In case of a U85 SoC where we want to use the DRAM(Dedicated_Sram), overwrite the weak symbol in the arm_executor_runner.cpp with the correct array for the cache. I believe this should pass your internal CI and is close to a real system where for Dedicated_Sram, the user needs to carve out specified amount of memory for the NPU. Regarding #10216, in principle it may be useful, but I don't think we need a backend specific configuration to enable the Dedicated_Sram mode on the U85. Also, note that in executorch/backends/arm/arm_vela.py, in the vela npz file you have the size of the scratch buffer and the fast scratch buffer. For Shared_Sram/Sram_Only, the fast_scratch_buffer is 0 and for Dedicated_Sram, it is equal to the amount of SRAM that the NPU can utilise. |
I think weak symbol override is fine. As long as we don't tightly couple the delegate runtime vs. app runner.
Generally speaking, I see this mechanism as a cleaner mechanism to pass information from outside to a delegate especially when it originates at runtime. For compile-time information we can use other mechanisms like compile_spec, preprocessor macros/variables, etc. |
FYI - here is a pr(#11459) with the suggested fix with the weak symbol |
Summary: Fix CI
Differential Revision: D74939323