You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note to self: sopath is missing in the current version. Copy the reported path to ./${MODEL_REPO}.so
69
+
When you have exported the model, you can test the model with the sequence generator by importing the compiled DSO model with the `-sopath ./{modelname}.so` option.
70
+
This gives users the ability to test their model, run any pre-existing model tests against the exported model with the same interface,
71
+
and support additional experiments to confirm model quality and speed.
71
72
72
73
```
73
74
python generate.py --device {cuda,cpu} --dso ./${MODEL_REPO}.so --prompt "Hello my name is"
@@ -82,17 +83,20 @@ Note to self: --dso does not currently take an argument, and always loads storie
82
83
Use a small model like stories15M.pt to test the instructions in the following section. You must first have ExecuTorch installed before running this command, see the installation instructions in the sections [here](#installation-instructions).
83
84
84
85
The environment variable MODEL_REPO should point to a directory with the `model.pth` file and `tokenizer.model` file.
85
-
The command below will add the file "llama-fast.pte" to your MODEL_REPO directory.
86
+
The command below will add the file "${MODEL_REPO}.pte" to your current directory.
TODO(fix this): the export command works with "--xnnpack" flag, but the next generate.py command will not run it so we do not set it right now.
93
+
When you have exported the model, you can test the model with the sequence generator by importing the compiled DSO model with the `---ptepath ./{modelname}.pte` option.
94
+
This gives users the ability to test their model, run any pre-existing model tests against the exported model with the same interface,
95
+
and support additional experiments to confirm model quality and speed.
92
96
93
-
To run the pte file, run this. Note that this is very slow at the moment.
97
+
To run the pte file in s. Note that this is very slow at the moment.
94
98
```
95
-
python generate.py --checkpoint_path $MODEL_REPO/model.pth --pte $MODEL_REPO/llama-fast.pte --prompt "Hello my name is" --device cpu
99
+
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --pte ${MODEL_REPO}.pte --prompt "Hello my name is" --device cpu
96
100
```
97
101
but *that requires xnnpack to work in python!*
98
102
@@ -106,12 +110,12 @@ memory of a mobile device, and optimize execution speed -- both using quantizati
106
110
The simplest way to quantize is with int8 quantization, where each value is represented by an 8 bit integer, and a
Now you can run your model with the same command as before:
113
117
```
114
-
python generate.py --ptr ./${MODEL_REPO}_int8.pte --prompt "Hello my name is"
118
+
python generate.py --pte ${MODEL_REPO}_int8.pte --prompt "Hello my name is"
115
119
```
116
120
117
121
#### 4 bit integer quantization (8da4w)
@@ -134,10 +138,9 @@ TBD.
134
138
# Standalone Execution
135
139
136
140
## Desktop and Server Execution
137
-
This has been tested with Linux and x86 (using CPU ~and GPU~), and MacOS and ARM/Apple Silicon.
141
+
This has been tested with Linux and x86 (using CPU ~and GPU~), and MacOS and ARM/Apple Silicon.
138
142
139
-
In addition to running with the generate.py driver in Python, you can also run PyTorch models without the Python runtime, based on Andrej's magnificent llama2.c code.
140
-
(Installation instructions courtesy of @Bert Maher's llama2.so)
143
+
The runner-* directories show how to integrate AOTI- and ET-exported models in a C/C++ application when no Python environment is available. Integrate it with your own applications and adapt it to your own application and model needs!
141
144
142
145
Build the runner like this
143
146
```
@@ -151,7 +154,7 @@ To run, use the following command (assuming you already generated the tokenizer.
Check out the [tutorial on how to build an Android app running your PyTorch models with Executorch](https://pytorch.org/executorch/main/llm/llama-demo-android.html), and give your llama-fast models a spin.
Open the ios Llama Xcode project at https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj in Xcode and click Run.
184
+
You will need to provide a provisioning profile (similar to what's expected for any iOS dev).
185
+
186
+
Once you can run the app on you device,
187
+
1 - connect the device to you Mac,
188
+
2 - copy the model and tokenizer.bin to the iOS Llama app
189
+
3 - select the tokenizer and model with the `(...)` control (bottom left of screen, to the left of the text entrybox)
190
+
170
191
# Supported Systems
171
192
172
193
PyTorch and the mobile Executorch backend support a broad range fo devices for running PyTorch with Python (using either eager or eager + torch.compile) or using a Python-free environment with AOT Inductor , as well as runtimes for executing exported models.
0 commit comments