Skip to content

Commit 633913c

Browse files
authored
Support for Quark quantization. Update lemonade getting started. (#290)
1 parent b805839 commit 633913c

File tree

16 files changed

+1098
-600
lines changed

16 files changed

+1098
-600
lines changed

.github/workflows/test_quark.yml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# This workflow will install Python dependencies, run tests and lint with a single version of Python
2+
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
3+
4+
name: Test Lemonade with Quark Quantization
5+
6+
on:
7+
push:
8+
branches: ["main"]
9+
pull_request:
10+
branches: ["main"]
11+
12+
permissions:
13+
contents: read
14+
15+
jobs:
16+
make-quark-lemonade:
17+
env:
18+
LEMONADE_CI_MODE: "True"
19+
runs-on: windows-latest
20+
steps:
21+
- uses: actions/checkout@v3
22+
- name: Set up Miniconda with 64-bit Python
23+
uses: conda-incubator/setup-miniconda@v2
24+
with:
25+
miniconda-version: "latest"
26+
activate-environment: lemon
27+
python-version: "3.10"
28+
run-post: "false"
29+
- name: Install dependencies
30+
shell: bash -el {0}
31+
run: |
32+
python -m pip install --upgrade pip
33+
conda install pylint
34+
python -m pip check
35+
pip install -e .[llm-oga-cpu]
36+
lemonade-install --quark 0.6.0
37+
- name: Lint with Black
38+
uses: psf/black@stable
39+
with:
40+
options: "--check --verbose"
41+
src: "./src"
42+
- name: Lint with PyLint
43+
shell: bash -el {0}
44+
run: |
45+
pylint src/lemonade/tools/quark --rcfile .pylintrc --disable E0401
46+
- name: Run lemonade tests
47+
shell: bash -el {0}
48+
env:
49+
HF_TOKEN: "${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}" # Required by OGA model_builder in OGA 0.4.0 but not future versions
50+
run: |
51+
python test/lemonade/quark_api.py
52+

docs/lemonade/getting_started.md

Lines changed: 152 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,125 @@
11
# Lemonade
22

33
Welcome to the project page for `lemonade` the Turnkey LLM Aide!
4-
Contents:
54

6-
1. [Getting Started](#getting-started)
7-
1. [Install Specialized Tools](#install-specialized-tools)
8-
- [OnnxRuntime GenAI](#install-onnxruntime-genai)
9-
- [RyzenAI NPU for PyTorch](#install-ryzenai-npu-for-pytorch)
5+
1. [Install](#install)
6+
1. [CLI Commands](#cli-commands)
7+
- [Syntax](#syntax)
8+
- [Chatting](#chatting)
9+
- [Accuracy](#accuracy)
10+
- [Benchmarking](#benchmarking)
11+
- [Memory Usage](#memory-usage)
12+
- [Serving](#serving)
13+
1. [API Overview](#api)
1014
1. [Code Organization](#code-organization)
1115
1. [Contributing](#contributing)
1216

13-
# Getting Started
1417

15-
`lemonade` introduces a brand new set of LLM-focused tools.
18+
# Install
1619

17-
## Install
20+
You can quickly get started with `lemonade` by installing the `turnkeyml` [PyPI package](#from-pypi) with the appropriate extras for your backend, or you can [install from source](#from-source-code) by cloning and installing this repository.
21+
22+
## From PyPI
23+
24+
To install `lemonade` from PyPI:
25+
26+
1. Create and activate a [miniconda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe) environment.
27+
```bash
28+
conda create -n lemon python=3.10
29+
cond activate lemon
30+
```
31+
32+
3. Install lemonade for you backend of choice:
33+
- [OnnxRuntime GenAI with CPU backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md):
34+
```bash
35+
pip install -e turnkeyml[llm-oga-cpu]
36+
```
37+
- [OnnxRuntime GenAI with Integrated GPU (iGPU, DirectML) backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md):
38+
> Note: Requires Windows and a DirectML-compatible iGPU.
39+
```bash
40+
pip install -e turnkeyml[llm-oga-igpu]
41+
```
42+
- OnnxRuntime GenAI with Ryzen AI Hybrid (NPU + iGPU) backend:
43+
> Note: Ryzen AI Hybrid requires a Windows 11 PC with a AMD Ryzen™ AI 9 HX375, Ryzen AI 9 HX370, or Ryzen AI 9 365 processor.
44+
> - Install the [Ryzen AI driver >= 32.0.203.237](https://ryzenai.docs.amd.com/en/latest/inst.html#install-npu-drivers) (you can check your driver version under Device Manager > Neural Processors).
45+
> - Visit the [AMD Hugging Face page](https://huggingface.co/collections/amd/quark-awq-g128-int4-asym-fp16-onnx-hybrid-13-674b307d2ffa21dd68fa41d5) for supported checkpoints.
46+
```bash
47+
pip install -e turnkeyml[llm-oga-hybrid]
48+
lemonade-install --ryzenai hybrid
49+
```
50+
- Hugging Face (PyTorch) LLMs for CPU backend:
51+
```bash
52+
pip install -e turnkeyml[llm]
53+
```
54+
- llama.cpp: see [instructions](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/llamacpp.md).
55+
56+
4. Use `lemonade -h` to explore the LLM tools, and see the [command](#cli-commands) and [API](#api) examples below.
57+
58+
59+
## From Source Code
60+
61+
To install `lemonade` from source code:
1862

1963
1. Clone: `git clone https://github.com/onnx/turnkeyml.git`
20-
1. `cd turnkeyml` (where `turnkeyml` is the repo root of your TurnkeyML clone)
64+
1. `cd turnkeyml` (where `turnkeyml` is the repo root of your clone)
2165
- Note: be sure to run these installation instructions from the repo root.
22-
1. Create and activate a conda environment:
23-
1. `conda create -n lemon python=3.10`
24-
1. `conda activate lemon`
25-
1. Install lemonade: `pip install -e .[llm]`
26-
- or `pip install -e .[llm-oga-igpu]` if you want to use `onnxruntime-genai` (see [OGA](#install-onnxruntime-genai))
27-
1. `lemonade -h` to explore the LLM tools
66+
1. Follow the same instructions as in the [PyPI installation](#from-pypi), except replace the `turnkeyml` with a `.`.
67+
- For example: `pip install -e .[llm-oga-igpu]`
68+
69+
# CLI Commands
70+
71+
The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.
72+
73+
Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a `Tool`, and a single call to `lemonade` can invoke any number of `Tools`. Each `Tool` will perform its functionality, then pass its state to the next `Tool` in the command.
74+
75+
You can read each command out loud to understand what it is doing. For example, a command like this:
76+
77+
```bash
78+
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
79+
```
80+
81+
Can be read like this:
2882

29-
## Syntax
83+
> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), on to the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response.
84+
85+
The `lemonade -h` command will show you which options and Tools are available, and `lemonade TOOL -h` will tell you more about that specific Tool.
3086

31-
The `lemonade` CLI uses the same style of syntax as `turnkey`, but with a new set of LLM-specific tools. You can read about that syntax [here](https://github.com/onnx/turnkeyml#how-it-works).
3287

3388
## Chatting
3489

3590
To chat with your LLM try:
3691

37-
`lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"`
92+
OGA iGPU:
93+
```bash
94+
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
95+
```
96+
97+
Hugging Face:
98+
```bash
99+
lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"
100+
```
38101

39-
The LLM will run on CPU with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like.
102+
The LLM will run with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like.
40103
41-
You can also replace the `facebook/opt-125m` with any Huggingface checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.
104+
You can also replace the `facebook/opt-125m` with any Hugging Face checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.
42105
43-
You can also set the `--device` argument in `huggingface-load` to load your LLM on a different device.
106+
You can also set the `--device` argument in `oga-load` and `huggingface-load` to load your LLM on a different device.
44107
45108
Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about those tools.
46109
47110
## Accuracy
48111
49112
To measure the accuracy of an LLM using MMLU, try this:
50113
51-
`lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management`
114+
OGA iGPU:
115+
```bash
116+
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 accuracy-mmlu --tests management
117+
```
118+
119+
Hugging Face:
120+
```bash
121+
lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management
122+
```
52123
53124
That command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at `~/.cache/lemonade`.
54125
@@ -58,18 +129,34 @@ You can run the full suite of MMLU subjects by omitting the `--test` argument. Y
58129
59130
To measure the time-to-first-token and tokens/second of an LLM, try this:
60131
61-
`lemonade -i facebook/opt-125m huggingface-load huggingface-bench`
132+
OGA iGPU:
133+
```bash
134+
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench
135+
```
136+
137+
Hugging Face:
138+
```bash
139+
lemonade -i facebook/opt-125m huggingface-load huggingface-bench
140+
```
62141
63142
That command will run a few warmup iterations, then a few generation iterations where performance data is collected.
64143
65-
The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade huggingface-bench -h`.
144+
The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade oga-bench -h` or `lemonade huggingface-bench -h`.
66145
67146
## Memory Usage
68147
69-
The peak memory used by the lemonade build is captured in the build output. To capture more granular
148+
The peak memory used by the `lemonade` build is captured in the build output. To capture more granular
70149
memory usage information, use the `--memory` flag. For example:
71150
72-
`lemonade -i facebook/opt-125m --memory huggingface-load huggingface-bench`
151+
OGA iGPU:
152+
```bash
153+
lemonade --memory -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench
154+
```
155+
156+
Hugging Face:
157+
```bash
158+
lemonade --memory -i facebook/opt-125m huggingface-load huggingface-bench
159+
```
73160
74161
In this case a `memory_usage.png` file will be generated and stored in the build folder. This file
75162
contains a figure plotting the memory usage over the build time. Learn more by running `lemonade -h`.
@@ -78,70 +165,66 @@ contains a figure plotting the memory usage over the build time. Learn more by
78165
79166
You can launch a WebSocket server for your LLM with:
80167
81-
`lemonade -i facebook/opt-125m huggingface-load serve`
82-
83-
Once the server has launched, you can connect to it from your own application, or interact directly by following the on-screen instructions to open a basic web app.
84-
85-
Note that the `llm-prompt`, `accuracy-mmlu`, and `serve` tools can all be used with other model-loading tools, for example `onnxruntime-genai` or `ryzenai-transformers`. See [Install Specialized Tools](#install-specialized-tools) for details.
86-
87-
## API
88-
89-
Lemonade is also available via API. Here's a quick example of how to benchmark an LLM:
90-
91-
```python
92-
import lemonade.tools.torch_llm as tl
93-
import lemonade.tools.chat as cl
94-
from turnkeyml.state import State
95-
96-
state = State(cache_dir="cache", build_name="test")
97-
98-
state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
99-
state = cl.Prompt().run(state, prompt="hi", max_new_tokens=15)
168+
OGA iGPU:
169+
```bash
170+
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 serve
171+
```
100172
101-
print("Response:", state.response)
173+
Hugging Face:
174+
```bash
175+
lemonade -i facebook/opt-125m huggingface-load serve
102176
```
103177
104-
# Install Specialized Tools
178+
Once the server has launched, you can connect to it from your own application, or interact directly by following the on-screen instructions to open a basic web app.
105179
106-
Lemonade supports specialized tools that each require their own setup steps. **Note:** These tools will only appear in `lemonade -h` if you run in an environment that has completed setup.
180+
# API
107181
108-
## Install OnnxRuntime-GenAI
182+
Lemonade is also available via API.
109183
110-
To install support for [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai), use `pip install -e .[llm-oga-igpu]` instead of the default installation command.
184+
## LEAP APIs
111185
112-
You can then load supported OGA models on to CPU or iGPU with the `oga-load` tool, for example:
186+
The lemonade enablement platform (LEAP) API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid). This makes it easy to integrate lemonade LLMs into Python applications.
113187
114-
`lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"`
188+
OGA iGPU:
189+
```python
190+
from lemonade import leap
115191
116-
You can also launch a server process with:
192+
model, tokenizer = leap.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="oga-igpu")
117193
118-
The `oga-bench` tool is available to capture tokens/second and time-to-first-token metrics: `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench`. Learn more with `lemonade oga-bench -h`.
194+
input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
195+
response = model.generate(input_ids, max_new_tokens=30)
119196
120-
You can also try Phi-3-Mini-128k-Instruct with the following commands:
197+
print(tokenizer.decode(response[0]))
198+
```
121199
122-
`lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 serve`
200+
You can learn more about the LEAP APIs [here](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade).
123201
124-
You can learn more about the CPU and iGPU support in our [OGA documentation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md).
202+
## Low-Level API
125203
126-
> Note: early access to AMD's RyzenAI NPU is also available. See the [RyzenAI NPU OGA documentation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_npu.md) for more information.
204+
The low-level API is useful for designing custom experiments. For example, sweeping over specific checkpoints, devices, and/or tools.
127205
128-
## Install RyzenAI NPU for PyTorch
206+
Here's a quick example of how to prompt a Hugging Face LLM using the low-level API, which calls the load and prompt tools one by one:
129207

130-
To run your LLMs on RyzenAI NPU, first install and set up the `ryzenai-transformers` conda environment (see instructions [here](https://github.com/amd/RyzenAI-SW/blob/main/example/transformers/models/llm/docs/README.md)). Then, install `lemonade` into `ryzenai-transformers`. The `ryzenai-npu-load` Tool will become available in that environment.
208+
```python
209+
import lemonade.tools.torch_llm as tl
210+
import lemonade.tools.chat as cl
211+
from turnkeyml.state import State
131212
132-
You can try it out with: `lemonade -i meta-llama/Llama-2-7b-chat-hf ryzenai-npu-load --device DEVICE llm-prompt -p "Hello, my thoughts are"`
213+
state = State(cache_dir="cache", build_name="test")
133214
134-
Where `DEVICE` is either "phx" or "stx" if you have a RyzenAI 7xxx/8xxx or 3xx/9xxx processor, respectively.
215+
state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
216+
state = cl.Prompt().run(state, prompt="hi", max_new_tokens=15)
135217
136-
> Note: only `meta-llama/Llama-2-7b-chat-hf` and `microsoft/Phi-3-mini-4k-instruct` are supported by `lemonade` at this time. Contributions appreciated!
218+
print("Response:", state.response)
219+
```
137220

138221
# Contributing
139222

140-
If you decide to contribute, please:
223+
Contributions are welcome! If you decide to contribute, please:
141224

142-
- do so via a pull request.
143-
- write your code in keeping with the same style as the rest of this repo's code.
144-
- add a test under `test/lemonade/llm_api.py` that provides coverage of your new feature.
225+
- Do so via a pull request.
226+
- Write your code in keeping with the same style as the rest of this repo's code.
227+
- Add a test under `test/lemonade` that provides coverage of your new feature.
145228
146229
The best way to contribute is to add new tools to cover more devices and usage scenarios.
147230
@@ -150,3 +233,5 @@ To add a new tool:
150233
1. (Optional) Create a new `.py` file under `src/lemonade/tools` (or use an existing file if your tool fits into a pre-existing family of tools).
151234
1. Define a new class that inherits the `Tool` class from `TurnkeyML`.
152235
1. Register the class by adding it to the list of `tools` near the top of `src/lemonade/cli.py`.
236+
237+
You can learn more about contributing on the repository's [contribution guide](https://github.com/onnx/turnkeyml/blob/main/docs/contribute.md).

0 commit comments

Comments
 (0)