Replies: 3 comments 3 replies
-
Hello, I understand the team is busy so no rush, but I was hoping someone could share what the timelines for this feature look like? I noticed that Qualcomm already supports this for the NPU through linked binaries, however, I had issues deploying my models with this approach and opted for executorch instead. Huge kudos to everyone on the team for developing this simple and performant framework! |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
🚀 The feature, motivation and pitch
Problem
In ExecuTorch today, models with multiple methods (e.g. prefill and decode) are exported as separate graphs. When lowering to a specific backend, each graph is lowered in isolation, without awareness or context of other graphs being lowered to the same backend. The problem arises when these separate graphs have shared components. In the case of a llama model with prefill and decode, linear layers in each method share the same weights and biases. Since the graphs of prefill and decode are lowered separately, shared weights and biases are copied and serialized twice in each backend payload. This results in model bloat from duplicated weights, which is a limiting factor when bringing models to production, especially on memory-constrained devices.
Requirements
Goals
Non-Goals
Design
We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (#8187). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.
cc. @lucylq @cccclai
Beta Was this translation helpful? Give feedback.
All reactions