Skip to content

Commit e4ef7cf

Browse files
committed
Docs for eval scripts
1 parent fb61a81 commit e4ef7cf

File tree

2 files changed

+87
-23
lines changed

2 files changed

+87
-23
lines changed

README.md

Lines changed: 2 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.
44

55

6-
## New in v0.1.0:
6+
## New in v0.1.0+:
77

88
- ExLlamaV2 now supports paged attention via [Flash Attention](https://github.com/Dao-AILab/flash-attention) 2.5.7+
99
- New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API
@@ -47,9 +47,6 @@ performs better anyway, see [here](doc/qcache_eval.md).)
4747
See the full, updated examples [here](https://github.com/turboderp/exllamav2/tree/master/examples).
4848

4949

50-
51-
52-
5350
## Performance
5451

5552
Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future,
@@ -190,25 +187,7 @@ script and its options are explained in [detail here](doc/convert.md)
190187
191188
### Evaluation
192189
193-
A script is provided to run the MMLU benchmark. In order to run it you first need to install these packages:
194-
195-
```
196-
#optional - create a python env
197-
python -m venv .venv
198-
#activate the enviroment
199-
source .venv/bin/activate
200-
201-
#install datasets
202-
pip install datasets
203-
204-
#install flash attention
205-
pip install flash-attn --no-build-isolation
206-
```
207-
208-
To run the benchmark:
209-
```
210-
python eval/mmlu.py -m /path/to/model
211-
```
190+
A number of evaluaion scripts are provided. See [here](doc/eval.md) for details.
212191
213192
### Community
214193

doc/eval.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# ExLlamaV2 Evaluation scripts
2+
3+
Common arguments:
4+
5+
- **-m / --model *directory***: _(required)_ Path to model (EXL2, GPTQ or FP16)
6+
7+
- **-gs / --gpu_split *list***: List of memory allocations per GPU, in GB for model weights and static buffers
8+
(excluding cache). Example: `-gs 10.5,0,10.5` would allocate 10.5 GB on CUDA devices 0 and 2 while skipping
9+
device 1. `-gs auto` will load the model in auto split mode, which fills available devices in order.
10+
11+
- **-l / --length *int***: Context length. The default is the model's native context length, which may be
12+
excessive for most benchmarks.
13+
14+
- **-rs / --rope_scale *float***: RoPE scale factor (linear)
15+
16+
- **-ra / --rope_alpha *float***: RoPE scale factor (NTK)
17+
18+
- **-nfa / --no_flash_attn**: Don't use flash-attn.
19+
20+
- **-nxf / --no_transformers**: Don't use xformers.
21+
22+
- **-fst / --fast_safetensors**: Use alternative loading mode. On Linux, this mode uses direct I/O and pinned
23+
buffers and can potentially load faster from very fast NVMe RAID arrays with a cold cache. On Windows, this
24+
uses regular non-memorymapped I/O and is typically just slower. However, in either case this can fix situations
25+
in which ExLlama runs out of system memory when loading large models.
26+
27+
## HumanEval
28+
29+
This is the standard [HumanEval](https://github.com/openai/human-eval) test implemented for ExLlamaV2 with
30+
dynamic batching.
31+
32+
```
33+
pip install human-eval
34+
python eval/humaneval.py -m <model_dir> -o humaneval_output.json
35+
evaluate-functional-correctness humaneval_output.json
36+
```
37+
38+
Arguments:
39+
40+
- **-o / --output *file***: _(required)_ Output JSON file to receive generated samples
41+
42+
- **-spt / --samples_per_task *int***: Number of samples for each HumanEval task. A single sample per task is
43+
sufficient to compute an approximate pass@1 score, but more samples give a more accurate score. At least 10
44+
samples is required for a pass@10 score, etc.
45+
46+
- **--max_tokens *int***: Maximum number of tokens to generate before cutting a sample short. The stop condition
47+
for each generation, if this limit isn't reached first, is the first newline character not followed by
48+
whitespace, i.e. the first non-indented line after the function definition has been generated. Default is 768
49+
which seems sufficient for most HumanEval tasks.
50+
51+
- **-pf / --prompt *str***: By default, the sample is a raw completion suitable for both base models and instruct
52+
tuned models Supplying a prompt format turns each task into an instruct prompt asking for the completion with a
53+
prefix for the response.
54+
55+
- **-v / --verbose**: Output completions as they're being generated (otherwise show a progress bar.)
56+
57+
- **-cs / --cache_size *int***: Total number of cache tokens. Set this as high as possible for best batching
58+
performance.
59+
60+
- **-cq4 / --cache_q4**: Use Q4 cache
61+
62+
## MMLU
63+
64+
This is the standard [MMLU](https://github.com/hendrycks/test) test implemented for ExLlamaV2 with
65+
dynamic batching.
66+
67+
```
68+
pip install datasets
69+
python eval/mmlu.py -m <model_dir>
70+
```
71+
72+
Arguments:
73+
74+
- **-sub / --subjects *list***: Limit test to the listed subjects, otherwise test on all subjects. E.g.
75+
`-sub anatomy,nutrition,professional_medicine`. See [the dataset](https://huggingface.co/datasets/cais/mmlu) for
76+
the full list of subjects.
77+
78+
- **-fs / --fewshot_examples *int***: Number of fewshot examples before each question. Default is 5.
79+
80+
- **-shf / --shuffle**: Shuffle the answer choices to each question.
81+
82+
- **-cs / --cache_size *int***: Total number of cache tokens. Set this as high as possible for best batching
83+
performance.
84+
85+
- **-cq4 / --cache_q4**: Use Q4 cache

0 commit comments

Comments
 (0)