Skip to content

Commit 0a58235

Browse files
Update version to 5.0.1 (#265)
* update version * update workflow * update docs * Update notebook naming * put the notebook in the right folder * Fix llamacpp. Polish naming. --------- Co-authored-by: Jeremy Fowers <jeremy.fowers@amd.com>
1 parent 4e7450d commit 0a58235

File tree

14 files changed

+128
-114
lines changed

14 files changed

+128
-114
lines changed

.github/workflows/test_lemonade.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ jobs:
4444
- name: Lint with PyLint
4545
shell: bash -el {0}
4646
run: |
47-
pylint src/turnkeyml/llm --rcfile .pylintrc --disable E0401
47+
pylint src/lemonade --rcfile .pylintrc --disable E0401
4848
- name: Test HF+CPU server
4949
if: runner.os == 'Windows'
5050
timeout-minutes: 10

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Welcome to ONNX TurnkeyML
22

33
[![Turnkey tests](https://github.com/onnx/turnkeyml/actions/workflows/test_turnkey.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
4-
[![Turnkey-LLM tests](https://github.com/onnx/turnkeyml/actions/workflows/test_lemonade.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
4+
[![Lemonade tests](https://github.com/onnx/turnkeyml/actions/workflows/test_lemonade.yml/badge.svg)](https://github.com/onnx/turnkeyml/tree/main/test "Check out our tests")
55
[![OS - Windows | Linux](https://img.shields.io/badge/OS-windows%20%7C%20linux-blue)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")
66
[![Made with Python](https://img.shields.io/badge/Python-3.8,3.10-blue?logo=python&logoColor=white)](https://github.com/onnx/turnkeyml/blob/main/docs/install.md "Check out our instructions")
77

docs/lemonade_getting_started.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Turnkey-LLM
1+
# Lemonade
22

3-
Welcome to the project page for `turnkey-llm` (aka, "lemonade" the turnkey LLM Aide)!
3+
Welcome to the project page for `lemonade` the Turnkey LLM Aide!
44
Contents:
55

66
1. [Getting Started](#getting-started)
@@ -12,16 +12,16 @@ Contents:
1212

1313
# Getting Started
1414

15-
`turnkey-llm` introduces a brand new set of LLM-focused tools.
15+
`lemonade` introduces a brand new set of LLM-focused tools.
1616

1717
## Install
1818

1919
1. Clone: `git clone https://github.com/onnx/turnkeyml.git`
2020
1. `cd turnkeyml` (where `turnkeyml` is the repo root of your TurnkeyML clone)
2121
- Note: be sure to run these installation instructions from the repo root.
2222
1. Create and activate a conda environment:
23-
1. `conda create -n tk-llm python=3.10`
24-
1. `conda activate tk-llm`
23+
1. `conda create -n lemon python=3.10`
24+
1. `conda activate lemon`
2525
1. Install lemonade: `pip install -e .[llm]`
2626
- or `pip install -e .[llm-oga-dml]` if you want to use `onnxruntime-genai` (see [OGA](#install-onnxruntime-genai))
2727
1. `lemonade -h` to explore the LLM tools
@@ -137,6 +137,6 @@ The best way to contribute is to add new tools to cover more devices and usage s
137137

138138
To add a new tool:
139139

140-
1. (Optional) Create a new `.py` file under `src/turnkeyml/llm/tools` (or use an existing file if your tool fits into a pre-existing family of tools).
140+
1. (Optional) Create a new `.py` file under `src/lemonade/tools` (or use an existing file if your tool fits into a pre-existing family of tools).
141141
1. Define a new class that inherits the `Tool` class from `TurnkeyML`.
142-
1. Register the class by adding it to the list of `tools` near the top of `src/turnkeyml/llm/cli.py`.
142+
1. Register the class by adding it to the list of `tools` near the top of `src/lemonade/cli.py`.

docs/llamacpp.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,3 +124,4 @@ The integration provides:
124124
- Proper error handling with detailed messages
125125
- Performance metrics collection
126126
- Configurable generation parameters (temperature, top_p, top_k)
127+
- 10-minute timeout for model generation to prevent indefinite hangs

docs/tools_user_guide.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -245,4 +245,16 @@ For example:
245245

246246
```
247247
export TURNKEY_BUILD_MONITOR="False"
248-
```
248+
```
249+
250+
### Adjust Build Monitor Update Frequency
251+
252+
The build status monitor updates its display periodically to show progress. By default, it updates every 0.5 seconds, but you can adjust the update frequency by setting the `TURNKEY_BUILD_MONITOR_FREQUENCY` environment variable to the desired number of seconds between updates.
253+
254+
For example:
255+
256+
```
257+
export TURNKEY_BUILD_MONITOR_FREQUENCY="10.0"
258+
```
259+
260+
This can be useful in long runs where frequent terminal updates might cause excessive terminal output.
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"source": [
77
"# LLMs on RyzenAI with TurnkeyML\n",
88
"\n",
9-
"This notebook will demonstrate how to bring up an example application that uses a RyzenAI to perform LLM inference. We will use the `turnkeyml.llm` APIs in order to make this as quick as possible. This notebook makes use of both the `RyzenAI NPU`, as well as the `RyzenAI Radeon integrated GPU (iGPU)`."
9+
"This notebook will demonstrate how to bring up an example application that uses a RyzenAI to perform LLM inference. We will use the `lemonade` APIs in order to make this as quick as possible. This notebook makes use of both the `RyzenAI NPU`, as well as the `RyzenAI Radeon integrated GPU (iGPU)`."
1010
]
1111
},
1212
{
@@ -84,7 +84,7 @@
8484
"metadata": {},
8585
"outputs": [],
8686
"source": [
87-
"# Import the turnkey APIs\n",
87+
"# Import the lemonade APIs\n",
8888
"from lemonade import leap\n",
8989
"\n",
9090
"# Load the model on to RyzenAI NPU\n",
@@ -121,7 +121,7 @@
121121
"\n",
122122
"### Prequisites for iGPU\n",
123123
"\n",
124-
"- `turnkeyml[llm-oga-dml]` is installed into an activated conda environment.\n",
124+
"- `turnkeyml[oga-dml]` is installed into an activated conda environment.\n",
125125
"- Download a copy of `Phi-3-mini`\n",
126126
"- See https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm/README.md#install-onnxruntime-genai for details"
127127
]

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@
6262
"datasets",
6363
# Install human-eval from a forked repo with Windows support until the
6464
# PR (https://github.com/openai/human-eval/pull/53) is merged
65-
"human-eval @ git+https://github.com/ramkrishna2910/human-eval.git",
65+
"human-eval-windows==1.0.4",
6666
"fastapi",
6767
"uvicorn[standard]",
6868
],

src/lemonade/cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ def main():
101101
parser.error(
102102
"The first tool in the sequence needs to be one "
103103
"of the 'tools that can start a sequence.' Use "
104-
"`turnkey-llm -h` to see that list of tools."
104+
"`lemonade -h` to see that list of tools."
105105
)
106106
# Run the evaluation tools as a build
107107
sequence = Sequence(tools=tool_instances)

src/lemonade/tools/llamacpp.py

Lines changed: 31 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ def generate(
2626
temperature: float = 0.8,
2727
top_p: float = 0.95,
2828
top_k: int = 40,
29+
return_raw: bool = False,
2930
**kwargs, # pylint: disable=unused-argument
3031
):
3132
"""
@@ -40,10 +41,12 @@ def generate(
4041
temperature: Temperature for sampling (0.0 = greedy)
4142
top_p: Top-p sampling threshold
4243
top_k: Top-k sampling threshold
44+
return_raw: If True, returns the complete raw output including timing info
4345
**kwargs: Additional arguments (ignored)
4446
4547
Returns:
46-
List containing a single string with the generated text
48+
List containing a single string with the generated text, or raw output if
49+
return_raw=True
4750
"""
4851

4952
prompt = input_ids
@@ -68,6 +71,7 @@ def generate(
6871
"--top-k",
6972
str(top_k),
7073
"-e",
74+
"-no-cnv",
7175
]
7276

7377
cmd = [str(m) for m in cmd]
@@ -82,7 +86,7 @@ def generate(
8286
errors="replace",
8387
)
8488

85-
raw_output, stderr = process.communicate()
89+
raw_output, stderr = process.communicate(timeout=600)
8690
if process.returncode != 0:
8791
error_msg = f"llama.cpp failed with return code {process.returncode}.\n"
8892
error_msg += f"Command: {' '.join(cmd)}\n"
@@ -107,28 +111,36 @@ def generate(
107111
time_to_first_token_ms = float(parts.split("ms")[0].strip())
108112
self.time_to_first_token = time_to_first_token_ms / 1000
109113

114+
if return_raw:
115+
return [raw_output, stderr]
116+
117+
# Find where the prompt ends and the generated text begins
118+
prompt_found = False
119+
output_text = ""
120+
prompt_first_line = prompt.split("\n")[0]
121+
for line in raw_output.splitlines():
122+
if prompt_first_line in line:
123+
prompt_found = True
124+
if prompt_found:
125+
line = line.replace("</s> [end of text]", "")
126+
output_text = output_text + line
127+
128+
if not prompt_found:
129+
raise Exception(
130+
f"Could not find prompt '{prompt_first_line}' in llama.cpp output. "
131+
"This usually means the model failed to process the prompt correctly.\n"
132+
f"Raw output:\n{raw_output}\n"
133+
f"Stderr:\n{stderr}"
134+
)
135+
136+
# Return list containing the generated text
137+
return [output_text]
138+
110139
except Exception as e:
111140
error_msg = f"Failed to run llama.cpp command: {str(e)}\n"
112141
error_msg += f"Command: {' '.join(cmd)}"
113142
raise Exception(error_msg)
114143

115-
# Find where the prompt ends and the generated text begins
116-
prompt_found = False
117-
output_text = ""
118-
prompt_first_line = prompt.split("\n")[0]
119-
for line in raw_output.splitlines():
120-
if prompt_first_line in line:
121-
prompt_found = True
122-
if prompt_found:
123-
line = line.replace("</s> [end of text]", "")
124-
output_text = output_text + line
125-
126-
if not prompt_found:
127-
raise Exception("Prompt not found in result, this is a bug in lemonade.")
128-
129-
# Return list containing the generated text
130-
return [output_text]
131-
132144

133145
class LoadLlamaCpp(FirstTool):
134146
unique_name = "load-llama-cpp"

src/lemonade/tools/llamacpp_bench.py

Lines changed: 36 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
import argparse
22
import os
3-
import subprocess
43
import statistics
54
import tqdm
65
from turnkeyml.state import State
@@ -137,91 +136,51 @@ def run(
137136
for iteration in tqdm.tqdm(
138137
range(iterations), desc="iterations", disable=iterations < 2
139138
):
140-
cmd = [
141-
state.model.executable,
142-
"-m",
143-
state.model.model,
144-
"--ctx-size",
145-
str(context_size),
146-
"-n",
147-
str(output_tokens),
148-
"-t",
149-
str(state.model.threads),
150-
"-p",
151-
prompt,
152-
"-e",
153-
]
154-
155-
cmd = [str(m) for m in cmd]
156-
157139
try:
158-
process = subprocess.Popen(
159-
cmd,
160-
stdout=subprocess.PIPE,
161-
stderr=subprocess.PIPE,
162-
universal_newlines=True,
163-
encoding="utf-8",
164-
errors="replace",
165-
)
166-
167-
raw_output, stderr = process.communicate()
168-
if process.returncode != 0:
140+
# Use the adapter's generate method which already has the timeout and error handling
141+
raw_output, stderr = state.model.generate(prompt, return_raw=True)
142+
143+
# Parse the timing information from the output
144+
ms_per_token = None
145+
time_to_first_token_ms = None
146+
147+
# Look for timing in both stdout and stderr
148+
for output in [raw_output, stderr]:
149+
for line in output.splitlines():
150+
if "llama_perf_context_print: eval time =" in line:
151+
parts = line.split("(")[1].strip()
152+
parts = parts.split(",")
153+
ms_per_token = float(
154+
parts[0].split("ms per token")[0].strip()
155+
)
156+
if "llama_perf_context_print: prompt eval time =" in line:
157+
parts = line.split("=")[1].split("/")[0]
158+
time_to_first_token_ms = float(parts.split("ms")[0].strip())
159+
160+
if ms_per_token is None or time_to_first_token_ms is None:
169161
error_msg = (
170-
f"llama.cpp failed with return code {process.returncode}.\n"
162+
"Could not find timing information in llama.cpp output.\n"
171163
)
172-
error_msg += f"Command: {' '.join(cmd)}\n"
173-
error_msg += f"Error output:\n{stderr}\n"
174-
error_msg += f"Standard output:\n{raw_output}"
164+
error_msg += "Raw output:\n" + raw_output + "\n"
165+
error_msg += "Stderr:\n" + stderr
175166
raise Exception(error_msg)
176167

177-
if raw_output is None:
178-
raise Exception("No output received from llama.cpp process")
168+
# When output_tokens is set to 1 for accuracy tests, ms_per_token tends to 0
169+
# and causes a divide-by-zero error. Set tokens_per_second to 0 in such cases
170+
# as performance data for generating a few tokens is not relevant.
171+
tokens_per_second = 0
172+
if output_tokens > 5 and ms_per_token > 0:
173+
tokens_per_second = 1000 / ms_per_token
174+
time_to_first_token = time_to_first_token_ms / 1000
179175

180-
except Exception as e:
181-
error_msg = f"Failed to run llama.cpp command: {str(e)}\n"
182-
error_msg += f"Command: {' '.join(cmd)}"
183-
raise Exception(error_msg)
176+
if iteration > warmup_iterations - 1:
177+
iteration_tokens_per_second.append(tokens_per_second)
178+
iteration_time_to_first_token.append(time_to_first_token)
184179

185-
ms_per_token = None
186-
time_to_first_token_ms = None
187-
for line in raw_output.splitlines():
188-
if "llama_perf_context_print: eval time =" in line:
189-
parts = line.split("(")[1].strip()
190-
parts = parts.split(",")
191-
ms_per_token = float(parts[0].split("ms per token")[0].strip())
192-
if "llama_perf_context_print: prompt eval time =" in line:
193-
parts = line.split("=")[1].split("/")[0]
194-
time_to_first_token_ms = float(parts.split("ms")[0].strip())
195-
196-
if ms_per_token is None or time_to_first_token_ms is None:
197-
# Look in stderr as well since some versions of llama.cpp output timing there
198-
for line in stderr.splitlines():
199-
if "llama_perf_context_print: eval time =" in line:
200-
parts = line.split("(")[1].strip()
201-
parts = parts.split(",")
202-
ms_per_token = float(parts[0].split("ms per token")[0].strip())
203-
if "llama_perf_context_print: prompt eval time =" in line:
204-
parts = line.split("=")[1].split("/")[0]
205-
time_to_first_token_ms = float(parts.split("ms")[0].strip())
206-
207-
if ms_per_token is None or time_to_first_token_ms is None:
208-
error_msg = "Could not find timing information in llama.cpp output.\n"
209-
error_msg += "Raw output:\n" + raw_output + "\n"
210-
error_msg += "Error output:\n" + stderr
180+
except Exception as e:
181+
error_msg = f"Failed to run benchmark: {str(e)}"
211182
raise Exception(error_msg)
212183

213-
# When output_tokens is set to 1 for accuracy tests, ms_per_token tends to 0
214-
# and causes a divide-by-zero error. Set tokens_per_second to 0 in such cases
215-
# as performance data for generating a few tokens is not relevant.
216-
tokens_per_second = 0
217-
if output_tokens > 5 and ms_per_token > 0:
218-
tokens_per_second = 1000 / ms_per_token
219-
time_to_first_token = time_to_first_token_ms / 1000
220-
221-
if iteration > warmup_iterations - 1:
222-
iteration_tokens_per_second.append(tokens_per_second)
223-
iteration_time_to_first_token.append(time_to_first_token)
224-
225184
token_generation_tokens_per_second = statistics.mean(
226185
iteration_tokens_per_second
227186
)

0 commit comments

Comments
 (0)