onnx · ramkrishna2910 · Jan 10, 2025 · Jan 10, 2025
diff --git a/docs/llamacpp.md b/docs/llamacpp.md
@@ -1,48 +1,126 @@
 # LLAMA.CPP
 
-Run transformer models using a Llama.cpp binary and checkpoint. This model can then be used with chatting or benchmarks such as MMLU.
+Run transformer models using llama.cpp. This integration allows you to:
+1. Load and run llama.cpp models
+2. Benchmark model performance
+3. Use the models with other tools like chat or MMLU accuracy testing
 
 ## Prerequisites
 
-This flow has been verified with a generic Llama.cpp model.
+You need:
+1. A compiled llama.cpp executable (llama-cli or llama-cli.exe)
+2. A GGUF model file
 
-These instructions are only for linux or Windows with wsl. It may be necessary to be running WSL in an Administrator command prompt.
+### Building llama.cpp (if needed)
 
-These instructions also assumes that lemonade has been installed.
+#### Linux
+```bash
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make
+```
+
+#### Windows
+```bash
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+cmake -B build
+cmake --build build --config Release
+```
 
+The executable will be in `build/bin/Release/llama-cli.exe` on Windows or `llama-cli` in the root directory on Linux.
 
-### Set up Environment (Assumes TurnkeyML is already installed)
+## Usage
 
-Build or obtain the Llama.cpp model and desired checkpoint.
-For example (see the [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md
-) source for more details):
-1. cd ~
-1. git clone https://github.com/ggerganov/llama.cpp
-1. cd llama.cpp
-1. make
-1. cd models
-1. wget https://huggingface.co/TheBloke/Dolphin-Llama2-7B-GGUF/resolve/main/dolphin-llama2-7b.Q5_K_M.gguf
+### Loading a Model
 
+Use the `load-llama-cpp` tool to load a model:
 
-## Usage
+```bash
+lemonade -i MODEL_NAME load-llama-cpp \
+    --executable PATH_TO_EXECUTABLE \
+    --model-binary PATH_TO_GGUF_FILE
+```
 
-The Llama.cpp tool currently supports the following parameters
+Parameters:
+| Parameter     | Required | Default | Description                                           |
+|--------------|----------|---------|-------------------------------------------------------|
+| executable   | Yes      | -       | Path to llama-cli/llama-cli.exe                      |
+| model-binary | Yes      | -       | Path to .gguf model file                             |
+| threads      | No       | 1       | Number of threads for generation                      |
+| context-size | No       | 512     | Context window size                                  |
+| output-tokens| No       | 512     | Maximum number of tokens to generate                 |
 
-| Parameter    | Definition                                                                  | Default |
-| ---------    | ----------------------------------------------------                        | ------- |
-| executable   | Path to the Llama.cpp-generated application binary                          | None    |
-| model-binary | Model checkpoint (do not use if --input is passed to lemonade)              | None    |
-| threads      | Number of threads to use for computation                                    | 1       |
-| context-size | Maximum context length                                                      | 512     |
-| temp         | Temperature to use for inference (leave out to use the application default) | None    |
+### Benchmarking
 
-### Example (assuming Llama.cpp built and a checkpoint loaded as above)
+After loading a model, you can benchmark it using `llama-cpp-bench`:
 
 ```bash
-lemonade --input ~/llama.cpp/models/dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~/llama.cpp/llama-cli accuracy-mmlu --ntrain 5
+lemonade -i MODEL_NAME \
+    load-llama-cpp \
+        --executable PATH_TO_EXECUTABLE \
+        --model-binary PATH_TO_GGUF_FILE \
+    llama-cpp-bench
 ```
 
-On windows, the llama.cpp binary might be in a different location (such as llama.cpp\build\bin\Release\), in which case the command mgiht be something like:
+Benchmark parameters:
+| Parameter         | Default                    | Description                               |
+|------------------|----------------------------|-------------------------------------------|
+| prompt           | "Hello, I am conscious and"| Input prompt for benchmarking            |
+| context-size     | 512                        | Context window size                       |
+| output-tokens    | 512                        | Number of tokens to generate              |
+| iterations       | 1                          | Number of benchmark iterations            |
+| warmup-iterations| 0                          | Number of warmup iterations (not counted) |
+
+The benchmark will measure and report:
+- Time to first token (prompt evaluation time)
+- Token generation speed (tokens per second)
+
+### Example Commands
+
+#### Windows Example
 ```bash
-lemonade --input ~\llama.cpp\models\dolphin-llama2-7b.Q5_K_M.gguf load-llama-cpp --executable ~\llama.cpp\build\bin\Release\llama-cli accuracy-mmlu --ntrain 5
+# Load and benchmark a model
+lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
+    load-llama-cpp \
+        --executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \
+        --model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \
+    llama-cpp-bench \
+        --iterations 3 \
+        --warmup-iterations 1
+
+# Run MMLU accuracy test
+lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
+    load-llama-cpp \
+        --executable "C:\work\llama.cpp\build\bin\Release\llama-cli.exe" \
+        --model-binary "C:\work\llama.cpp\models\qwen2.5-0.5b-instruct-fp16.gguf" \
+    accuracy-mmlu \
+        --tests management \
+        --max-evals 2
 ```
+
+#### Linux Example
+```bash
+# Load and benchmark a model
+lemonade -i Qwen/Qwen2.5-0.5B-Instruct-GGUF \
+    load-llama-cpp \
+        --executable "./llama-cli" \
+        --model-binary "./models/qwen2.5-0.5b-instruct-fp16.gguf" \
+    llama-cpp-bench \
+        --iterations 3 \
+        --warmup-iterations 1
+```
+
+## Integration with Other Tools
+
+After loading with `load-llama-cpp`, the model can be used with any tool that supports the ModelAdapter interface, including:
+- accuracy-mmlu
+- llm-prompt
+- accuracy-humaneval
+- and more
+
+The integration provides:
+- Platform-independent path handling (works on both Windows and Linux)
+- Proper error handling with detailed messages
+- Performance metrics collection
+- Configurable generation parameters (temperature, top_p, top_k)
diff --git a/src/lemonade/cli.py b/src/lemonade/cli.py
@@ -14,7 +14,7 @@
 
 from lemonade.tools.huggingface_bench import HuggingfaceBench
 from lemonade.tools.ort_genai.oga_bench import OgaBench
-
+from lemonade.tools.llamacpp_bench import LlamaCppBench
 from lemonade.tools.llamacpp import LoadLlamaCpp
 
 import lemonade.cache as cache
@@ -30,6 +30,7 @@ def main():
     tools = [
         HuggingfaceLoad,
         LoadLlamaCpp,
+        LlamaCppBench,
         AccuracyMMLU,
         AccuracyHumaneval,
         AccuracyPerplexity,