Skip to content

Adds llamacpp benchmarking support #263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 105 additions & 27 deletions docs/llamacpp.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,126 @@
# LLAMA.CPP

Run transformer models using a Llama.cpp binary and checkpoint. This model can then be used with chatting or benchmarks such as MMLU.
Run transformer models using llama.cpp. This integration allows you to:
1. Load and run llama.cpp models
2. Benchmark model performance
3. Use the models with other tools like chat or MMLU accuracy testing

## Prerequisites

This flow has been verified with a generic Llama.cpp model.
You need:
1. A compiled llama.cpp executable (llama-cli or llama-cli.exe)
2. A GGUF model file

These instructions are only for linux or Windows with wsl. It may be necessary to be running WSL in an Administrator command prompt.
### Building llama.cpp (if needed)

These instructions also assumes that lemonade has been installed.
#### Linux
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
```

#### Windows
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
```

The executable will be in `build/bin/Release/llama-cli.exe` on Windows or `llama-cli` in the root directory on Linux.

### Set up Environment (Assumes TurnkeyML is already installed)
## Usage

Build or obtain the Llama.cpp model and desired checkpoint.
For example (see the [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md
) source for more details):
1. cd ~
1. git clone https://github.com/ggerganov/llama.cpp
1. cd llama.cpp
1. make
1. cd models
1. wget https://huggingface.co/TheBloke/Dolphin-Llama2-7B-GGUF/resolve/main/dolphin-llama2-7b.Q5_K_M.gguf
### Loading a Model

Use the `load-llama-cpp` tool to load a model:

## Usage
```bash
lemonade -i MODEL_NAME load-llama-cpp \
--executable PATH_TO_EXECUTABLE \
--model-binary PATH_TO_GGUF_FILE
```

The Llama.cpp tool currently supports the following parameters
Parameters:
| Parameter | Required | Default | Description |
|--------------|----------|---------|-------------------------------------------------------|
| executable | Yes | - | Path to llama-cli/llama-cli.exe |
| model-binary | Yes | - | Path to .gguf model file |
| threads | No | 1 | Number of threads for generation |
| context-size | No | 512 | Context window size |
| output-tokens| No | 512 | Maximum number of tokens to generate |

| Parameter | Definition | Default |
| --------- | ---------------------------------------------------- | ------- |
| executable | Path to the Llama.cpp-generated application binary | None |
| model-binary | Model checkpoint (do not use if --input is passed to lemonade) | None |
| threads | Number of threads to use for computation | 1 |
| context-size | Maximum context length | 512 |
| temp | Temperature to use for inference (leave out to use the application default) | None |
### Benchmarking

### Example (assuming Llama.cpp built and a checkpoint loaded as above)
After loading a model, you can benchmark it using `llama-cpp-bench`:

```bash
lemonade --input ~/llama.cpp/models/dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~/llama.cpp/llama-cli accuracy-mmlu --ntrain 5
lemonade -i MODEL_NAME \
load-llama-cpp \
--executable PATH_TO_EXECUTABLE \
--model-binary PATH_TO_GGUF_FILE \
llama-cpp-bench
```

On windows, the llama.cpp binary might be in a different location (such as llama.cpp\build\bin\Release\), in which case the command mgiht be something like:
Benchmark parameters:
| Parameter | Default | Description |
|------------------|----------------------------|-------------------------------------------|
| prompt | "Hello, I am conscious and"| Input prompt for benchmarking |
| context-size | 512 | Context window size |
| output-tokens | 512 | Number of tokens to generate |
| iterations | 1 | Number of benchmark iterations |
| warmup-iterations| 0 | Number of warmup iterations (not counted) |

The benchmark will measure and report:
- Time to first token (prompt evaluation time)
- Token generation speed (tokens per second)

### Example Commands

#### Windows Example
```bash
lemonade --input ~\llama.cpp\models\dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~\llama.cpp\build\bin\Release\llama-cli accuracy-mmlu --ntrain 5
# Load and benchmark a model
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
load-llama-cpp \
--executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \
--model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \
llama-cpp-bench \
--iterations 3 \
--warmup-iterations 1

# Run MMLU accuracy test
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
load-llama-cpp \
--executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \
--model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \
accuracy-mmlu \
--tests management \
--max-evals 2
```

#### Linux Example
```bash
# Load and benchmark a model
lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
load-llama-cpp \
--executable "./llama-cli" \
--model-binary "./models/qwen2.5-0.5b-instruct-fp16.gguf" \
llama-cpp-bench \
--iterations 3 \
--warmup-iterations 1
```

## Integration with Other Tools

After loading with `load-llama-cpp`, the model can be used with any tool that supports the ModelAdapter interface, including:
- accuracy-mmlu
- llm-prompt
- accuracy-humaneval
- and more

The integration provides:
- Platform-independent path handling (works on both Windows and Linux)
- Proper error handling with detailed messages
- Performance metrics collection
- Configurable generation parameters (temperature, top_p, top_k)
3 changes: 2 additions & 1 deletion src/lemonade/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

from lemonade.tools.huggingface_bench import HuggingfaceBench
from lemonade.tools.ort_genai.oga_bench import OgaBench

from lemonade.tools.llamacpp_bench import LlamaCppBench
from lemonade.tools.llamacpp import LoadLlamaCpp

import lemonade.cache as cache
Expand All @@ -30,6 +30,7 @@ def main():
tools = [
HuggingfaceLoad,
LoadLlamaCpp,
LlamaCppBench,
AccuracyMMLU,
AccuracyHumaneval,
AccuracyPerplexity,
Expand Down
Loading
Loading