|
| 1 | +# ExLlamaV2 Evaluation scripts |
| 2 | + |
| 3 | +Common arguments: |
| 4 | + |
| 5 | +- **-m / --model *directory***: _(required)_ Path to model (EXL2, GPTQ or FP16) |
| 6 | + |
| 7 | +- **-gs / --gpu_split *list***: List of memory allocations per GPU, in GB for model weights and static buffers |
| 8 | +(excluding cache). Example: `-gs 10.5,0,10.5` would allocate 10.5 GB on CUDA devices 0 and 2 while skipping |
| 9 | +device 1. `-gs auto` will load the model in auto split mode, which fills available devices in order. |
| 10 | + |
| 11 | +- **-l / --length *int***: Context length. The default is the model's native context length, which may be |
| 12 | +excessive for most benchmarks. |
| 13 | + |
| 14 | +- **-rs / --rope_scale *float***: RoPE scale factor (linear) |
| 15 | + |
| 16 | +- **-ra / --rope_alpha *float***: RoPE scale factor (NTK) |
| 17 | + |
| 18 | +- **-nfa / --no_flash_attn**: Don't use flash-attn. |
| 19 | + |
| 20 | +- **-nxf / --no_transformers**: Don't use xformers. |
| 21 | + |
| 22 | +- **-fst / --fast_safetensors**: Use alternative loading mode. On Linux, this mode uses direct I/O and pinned |
| 23 | +buffers and can potentially load faster from very fast NVMe RAID arrays with a cold cache. On Windows, this |
| 24 | +uses regular non-memorymapped I/O and is typically just slower. However, in either case this can fix situations |
| 25 | +in which ExLlama runs out of system memory when loading large models. |
| 26 | + |
| 27 | +## HumanEval |
| 28 | + |
| 29 | +This is the standard [HumanEval](https://github.com/openai/human-eval) test implemented for ExLlamaV2 with |
| 30 | +dynamic batching. |
| 31 | + |
| 32 | +``` |
| 33 | +pip install human-eval |
| 34 | +python eval/humaneval.py -m <model_dir> -o humaneval_output.json |
| 35 | +evaluate-functional-correctness humaneval_output.json |
| 36 | +``` |
| 37 | + |
| 38 | +Arguments: |
| 39 | + |
| 40 | +- **-o / --output *file***: _(required)_ Output JSON file to receive generated samples |
| 41 | + |
| 42 | +- **-spt / --samples_per_task *int***: Number of samples for each HumanEval task. A single sample per task is |
| 43 | +sufficient to compute an approximate pass@1 score, but more samples give a more accurate score. At least 10 |
| 44 | +samples is required for a pass@10 score, etc. |
| 45 | + |
| 46 | +- **--max_tokens *int***: Maximum number of tokens to generate before cutting a sample short. The stop condition |
| 47 | +for each generation, if this limit isn't reached first, is the first newline character not followed by |
| 48 | +whitespace, i.e. the first non-indented line after the function definition has been generated. Default is 768 |
| 49 | +which seems sufficient for most HumanEval tasks. |
| 50 | + |
| 51 | +- **-pf / --prompt *str***: By default, the sample is a raw completion suitable for both base models and instruct |
| 52 | +tuned models Supplying a prompt format turns each task into an instruct prompt asking for the completion with a |
| 53 | +prefix for the response. |
| 54 | + |
| 55 | +- **-v / --verbose**: Output completions as they're being generated (otherwise show a progress bar.) |
| 56 | + |
| 57 | +- **-cs / --cache_size *int***: Total number of cache tokens. Set this as high as possible for best batching |
| 58 | +performance. |
| 59 | + |
| 60 | +- **-cq4 / --cache_q4**: Use Q4 cache |
| 61 | + |
| 62 | +## MMLU |
| 63 | + |
| 64 | +This is the standard [MMLU](https://github.com/hendrycks/test) test implemented for ExLlamaV2 with |
| 65 | +dynamic batching. |
| 66 | + |
| 67 | +``` |
| 68 | +pip install datasets |
| 69 | +python eval/mmlu.py -m <model_dir> |
| 70 | +``` |
| 71 | + |
| 72 | +Arguments: |
| 73 | + |
| 74 | +- **-sub / --subjects *list***: Limit test to the listed subjects, otherwise test on all subjects. E.g. |
| 75 | +`-sub anatomy,nutrition,professional_medicine`. See [the dataset](https://huggingface.co/datasets/cais/mmlu) for |
| 76 | +the full list of subjects. |
| 77 | + |
| 78 | +- **-fs / --fewshot_examples *int***: Number of fewshot examples before each question. Default is 5. |
| 79 | + |
| 80 | +- **-shf / --shuffle**: Shuffle the answer choices to each question. |
| 81 | + |
| 82 | +- **-cs / --cache_size *int***: Total number of cache tokens. Set this as high as possible for best batching |
| 83 | +performance. |
| 84 | + |
| 85 | +- **-cq4 / --cache_q4**: Use Q4 cache |
0 commit comments