-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Move llm source code into src/lemonade dir. Add HumanEval. (#262)
* Move lemonade source code into its own module * Reorganize into src/lemonade. Add HumanEval. Co-authored-by: Ramakrishnan Sivakumar <[email protected]> Signed-off-by: Jeremy Fowers <[email protected]> --------- Signed-off-by: Jeremy Fowers <[email protected]> Co-authored-by: Ramakrishnan Sivakumar <[email protected]>
- Loading branch information
1 parent
5b2d152
commit ed30c98
Showing
30 changed files
with
630 additions
and
121 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# Using the HumanEval accuracy test tools | ||
|
||
The HumanEval benchmark is a code generation and functional correctness evaluation framework designed to assess language models' ability to generate Python code. It consists of 164 handwritten programming problems, each containing a function signature, docstring, body, and several unit tests. This benchmark focuses on evaluating a model's capability to generate functionally correct code that passes the test cases, making it particularly useful for assessing code generation capabilities. | ||
|
||
This tool provides an automated way to evaluate language models on the HumanEval benchmark. It handles the process of downloading the dataset, generating code completions, executing them in a secure environment, and calculating pass@k metrics. | ||
|
||
## Dataset | ||
|
||
The HumanEval dataset is automatically downloaded from [OpenAI's human-eval repository](https://github.com/openai/human-eval) when you first run the benchmark. The dataset contains programming problems that test various aspects of Python programming, including: | ||
|
||
- Basic programming operations | ||
- String manipulation | ||
- Mathematical computations | ||
- List operations | ||
- Algorithm implementation | ||
- Data structure manipulation | ||
|
||
## Running the Benchmark | ||
|
||
```bash | ||
lemonade -i meta-llama/Llama-3.2-1B oga-load --device igpu --dtype int4 accuracy-humaneval --k-samples 1 --first-n-samples 5 --timeout 30.0 | ||
``` | ||
|
||
### Optional arguments: | ||
|
||
`--k-samples`: Number of completions to generate per prompt (default: 1). This parameter determines the k in pass@k metrics. For example: | ||
- `--k-samples 1`: Calculates pass@1 (single attempt per problem) | ||
- `--k-samples 10`: Calculates pass@10 (ten attempts per problem) | ||
- `--k-samples 100`: Calculates pass@100 (hundred attempts per problem) | ||
|
||
Higher k values provide more robust evaluation but take longer to run. | ||
|
||
`--first-n-samples`: Evaluate only the first N problems from the dataset (default: entire dataset). Useful for quick testing or when you want to evaluate a subset of problems. | ||
|
||
`--timeout`: Maximum time in seconds allowed for each test case execution (default: 30.0). This prevents infinite loops or long-running code from blocking the evaluation. | ||
|
||
`--data-dir`: Custom directory for storing the HumanEval dataset (default: "<lemonade_cache_dir>/data/humaneval"). | ||
|
||
## How It Works | ||
|
||
1. **Dataset Preparation:** | ||
- On first run, the tool downloads the HumanEval dataset (HumanEval.jsonl.gz) | ||
- The dataset contains function signatures, docstrings, and test cases | ||
- Each problem is structured to test specific programming capabilities | ||
- You can evaluate only the first N problems using `--first-n-samples` | ||
|
||
2. **Code Generation:** | ||
- For each programming problem, the model is provided with a prompt containing: | ||
- Function signature (e.g., `def sort_numbers(numbers):`) | ||
- Docstring describing the function's purpose and requirements | ||
- The model generates k code completions for the function body (controlled by `--k-samples`) | ||
- These k samples are used to calculate the pass@k metric | ||
|
||
3. **Secure Execution:** | ||
- Generated code is executed in a secure sandbox environment maintained by OpenAI's human-eval library. For your awareness, OpenAI's policy is to disable code execution by default, however lemonade enables code execution by default by automatically setting the environment variable `HF_ALLOW_CODE_EVAL=1`. OpenAI provides the following code execution protections: | ||
- **Process Isolation**: Each code sample runs in a separate process to prevent interference | ||
- **Resource Limits**: | ||
- CPU time limit (controlled by `--timeout`) | ||
- Memory usage restrictions | ||
- Maximum output size restrictions | ||
- **Restricted Access**: | ||
- No network access | ||
- No file system access outside test directory | ||
- No subprocess creation | ||
- No system calls | ||
- **Module Restrictions**: | ||
- Only allows importing standard Python libraries needed for testing | ||
- Blocks potentially dangerous modules (os, sys, subprocess, etc.) | ||
These security measures are implemented through: | ||
- Python's built-in `resource` module for resource limits | ||
- AST (Abstract Syntax Tree) analysis for code validation | ||
- Process-level isolation using `multiprocessing` | ||
- Custom import hooks to restrict module access | ||
|
||
4. **Evaluation Metrics:** | ||
- **pass@k**: Percentage of problems solved with k attempts | ||
- pass@1: Success rate with single attempt | ||
- pass@10: Success rate within 10 attempts | ||
- pass@100: Success rate within 100 attempts | ||
- A problem is considered solved if all test cases pass | ||
- Results are normalized to percentages | ||
|
||
5. **Output Files:** | ||
The tool generates several output files in the results directory: | ||
- `evaluation_results.csv`: Contains prompts, completions, and expected answers | ||
- `humaneval_predictions.jsonl`: Raw model predictions in JSONL format | ||
- `humaneval_predictions.jsonl_results.jsonl`: Detailed evaluation results | ||
|
||
## Example Results Format | ||
|
||
The evaluation produces metrics in the following format: | ||
```json | ||
{ | ||
"pass@1": 0.25, // 25% success rate with 1 attempt | ||
"pass@10": 0.45, // 45% success rate within 10 attempts | ||
"pass@100": 0.65 // 65% success rate within 100 attempts | ||
} | ||
``` | ||
|
||
## Limitations | ||
|
||
1. **Resource Requirements**: Generating multiple samples per problem (high k values) can be computationally intensive and time-consuming. | ||
2. **Memory Usage**: Large language models may require significant memory, especially when generating multiple samples. | ||
|
||
## References | ||
|
||
1. [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374) | ||
2. [OpenAI HumanEval Repository](https://github.com/openai/human-eval) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
""" | ||
This example demonstrates how to use the LEAP API to load a model for | ||
inference on CPU using the hf-cpu recipe, and then use it to generate | ||
the response to a prompt. | ||
If you have a discrete GPU, you can try that by changing the recipe | ||
to hf-dgpu. Note: make sure to have torch+cuda installed when trying | ||
hf-dgpu. | ||
""" | ||
|
||
from lemonade import leap | ||
|
||
model, tokenizer = leap.from_pretrained("facebook/opt-125m", recipe="hf-cpu") | ||
|
||
input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids | ||
response = model.generate(input_ids, max_new_tokens=30) | ||
|
||
print(tokenizer.decode(response[0])) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
""" | ||
This example demonstrates how to use the LEAP API to load a model for | ||
inference on a Ryzen AI NPU using the ryzenai-npu-load recipe, | ||
and then use it to generate the response to a prompt. | ||
Note that this example will only run if the Ryzen AI NPU Private recipe is installed. | ||
See genai/docs/ryzenai_npu.md for instructions. | ||
You can try the same model on CPU by changing the recipe to "hf-cpu". | ||
""" | ||
|
||
from lemonade import leap | ||
|
||
model, tokenizer = leap.from_pretrained( | ||
"meta-llama/Llama-2-7b-chat-hf", recipe="ryzenai-npu" | ||
) | ||
|
||
input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids | ||
response = model.generate(input_ids, max_new_tokens=30) | ||
|
||
print(tokenizer.decode(response[0])) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
""" | ||
This example demonstrates how to use the LEAP API to load a model for | ||
inference on CPU using the hf-cpu recipe, and then use a thread to | ||
generate a streaming the response to a prompt. | ||
Note: this approach only works with recipes that support TextIteratorStreamer, | ||
i.e., huggingface-based recipes such as hf-cpu and ryzenai-npu. | ||
""" | ||
|
||
from thread import Thread | ||
from transformers import TextIteratorStreamer | ||
from lemonade import leap | ||
|
||
# Replace the recipe with "ryzenai-npu" to run on the RyzenAI NPU | ||
model, tokenizer = leap.from_pretrained( | ||
"meta-llama/Llama-2-7b-chat-hf", recipe="hf-cpu" | ||
) | ||
|
||
input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids | ||
|
||
streamer = TextIteratorStreamer( | ||
tokenizer, | ||
skip_prompt=True, | ||
) | ||
generation_kwargs = { | ||
"input_ids": input_ids, | ||
"streamer": streamer, | ||
"max_new_tokens": 30, | ||
} | ||
|
||
thread = Thread(target=model.generate, kwargs=generation_kwargs) | ||
thread.start() | ||
|
||
# Generate the response using streaming | ||
for new_text in streamer: | ||
print(new_text) | ||
|
||
thread.join() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,11 +3,11 @@ | |
with open("src/turnkeyml/version.py", encoding="utf-8") as fp: | ||
version = fp.read().split('"')[1] | ||
|
||
|
||
setup( | ||
name="turnkeyml", | ||
version=version, | ||
description="TurnkeyML Tools and Models", | ||
author="Jeremy Fowers, Daniel Holanda, Ramakrishnan Sivakumar, Victoria Godsoe", | ||
author_email="[email protected]", | ||
package_dir={"": "src", "turnkeyml_models": "models"}, | ||
packages=[ | ||
|
@@ -17,10 +17,10 @@ | |
"turnkeyml.sequence", | ||
"turnkeyml.cli", | ||
"turnkeyml.common", | ||
"turnkeyml.llm", | ||
"turnkeyml.llm.tools", | ||
"turnkeyml.llm.tools.ort_genai", | ||
"turnkeyml.llm.tools.ryzenai_npu", | ||
"lemonade", | ||
"lemonade.tools", | ||
"lemonade.tools.ort_genai", | ||
"lemonade.tools.ryzenai_npu", | ||
"turnkeyml_models", | ||
"turnkeyml_models.graph_convolutions", | ||
"turnkeyml_models.selftest", | ||
|
@@ -46,77 +46,61 @@ | |
"psutil", | ||
"wmi", | ||
"pytz", | ||
"tqdm", | ||
# Conditional dependencies for ONNXRuntime backends | ||
"onnxruntime >=1.10.1;platform_system=='Linux' and extra != 'llm-oga-cuda'", | ||
"onnxruntime-directml >=1.19.0;platform_system=='Windows' and extra != 'llm-oga-cuda'", | ||
"onnxruntime-gpu >=1.19.1;extra == 'llm-oga-cuda'", | ||
], | ||
extras_require={ | ||
"llm": [ | ||
"tqdm", | ||
"torch>=2.0.0", | ||
"transformers", | ||
"accelerate", | ||
"py-cpuinfo", | ||
"sentencepiece", | ||
"datasets", | ||
# Install human-eval from a forked repo with Windows support until the | ||
# PR (https://github.com/openai/human-eval/pull/53) is merged | ||
"human-eval @ git+https://github.com/ramkrishna2910/human-eval.git", | ||
"fastapi", | ||
"uvicorn[standard]", | ||
], | ||
"llm-oga-dml": [ | ||
"llm-oga-igpu": [ | ||
"onnxruntime-genai-directml==0.4.0", | ||
"tqdm", | ||
"torch>=2.0.0,<2.4", | ||
"transformers<4.45.0", | ||
"accelerate", | ||
"py-cpuinfo", | ||
"sentencepiece", | ||
"datasets", | ||
"fastapi", | ||
"uvicorn[standard]", | ||
"turnkeyml[llm]", | ||
], | ||
"llm-oga-cuda": [ | ||
"onnxruntime-genai-cuda==0.4.0", | ||
"tqdm", | ||
"torch>=2.0.0,<2.4", | ||
"transformers<4.45.0", | ||
"accelerate", | ||
"py-cpuinfo", | ||
"sentencepiece", | ||
"datasets", | ||
"fastapi", | ||
"uvicorn[standard]", | ||
"turnkeyml[llm]", | ||
], | ||
"llm-oga-npu": [ | ||
"transformers", | ||
"torch", | ||
"onnx==1.16.0", | ||
"onnxruntime==1.18.0", | ||
"numpy==1.26.4", | ||
"tqdm", | ||
"accelerate", | ||
"py-cpuinfo", | ||
"sentencepiece", | ||
"datasets", | ||
"fastapi", | ||
"uvicorn[standard]", | ||
"turnkeyml[llm]", | ||
], | ||
"llm-oga-hybrid": [ | ||
"transformers", | ||
"torch", | ||
"onnx==1.16.1", | ||
"numpy==1.26.4", | ||
"datasets", | ||
"fastapi", | ||
"uvicorn[standard]", | ||
"turnkeyml[llm]", | ||
], | ||
"cuda": [ | ||
"torch @ https://download.pytorch.org/whl/cu118/torch-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl", | ||
"torchvision @ https://download.pytorch.org/whl/cu118/torchvision-0.18.1%2Bcu118-cp310-cp310-win_amd64.whl", | ||
"torchaudio @ https://download.pytorch.org/whl/cu118/torchaudio-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl", | ||
], | ||
}, | ||
classifiers=[], | ||
entry_points={ | ||
"console_scripts": [ | ||
"turnkey=turnkeyml:turnkeycli", | ||
"turnkey-llm=turnkeyml.llm:lemonadecli", | ||
"lemonade=turnkeyml.llm:lemonadecli", | ||
"turnkey-llm=lemonade:lemonadecli", | ||
"lemonade=lemonade:lemonadecli", | ||
] | ||
}, | ||
python_requires=">=3.8, <3.12", | ||
|
File renamed without changes.
File renamed without changes.
Oops, something went wrong.