@@ -11,7 +11,27 @@ still needs a lot of testing and tuning, and a few key features are not yet impl
11
11
- Support for a new quant format (see below)
12
12
13
13
14
- ## How to do stuff
14
+ ## Performance
15
+
16
+ Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
17
+ speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
18
+
19
+ | Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
20
+ | ------------| --------------| -------| -------| -----| ------------| ----------| ------------| -------------|
21
+ | Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | ** 195** t/s |
22
+ | Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | ** 110** t/s |
23
+ | Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | ** 48** t/s |
24
+ | OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | ** 321** t/s |
25
+ | CodeLlama | EXL2 4.0 bpw | 34B | - | - | - | - | 42 t/s | ** 48** t/s |
26
+ | Llama2 | EXL2 3.0 bpw | 7B | - | - | - | - | 195 t/s | ** 224** t/s |
27
+ | Llama2 | EXL2 4.0 bpw | 7B | - | - | - | - | 164 t/s | ** 197** t/s |
28
+ | Llama2 | EXL2 5.0 bpw | 7B | - | - | - | - | 144 t/s | ** 160** t/s |
29
+ | Llama2 | EXL2 2.5 bpw | 70B | - | - | - | - | 30 t/s | ** 35** t/s |
30
+ | TinyLlama | EXL2 3.0 bpw | 1.1B | - | - | - | - | 536 t/s | ** 635** t/s |
31
+ | TinyLlama | EXL2 4.0 bpw | 1.1B | - | - | - | - | 509 t/s | ** 590** t/s |
32
+
33
+
34
+ ## How to
15
35
16
36
Clone the repository and install dependencies:
17
37
@@ -25,32 +45,15 @@ python test_inference -m <path_to_model> -p "Once upon a time,"
25
45
26
46
For now, a simple console chatbot is included. Run it with:
27
47
28
- ` python examples/chat.py -m <path_to_model> -mode llama `
48
+ ```
49
+ python examples/chat.py -m <path_to_model> -mode llama`
50
+ ```
29
51
30
52
The ` -mode ` argument chooses the prompt format to use. ` llama ` is for the Llama(2)-chat finetunes, while ` codellama `
31
53
probably works better for CodeLlama-instruct. ` raw ` will produce a simple chatlog-style chat that works with base
32
54
models and various other finetunes. You can also provide a custom system prompt with ` -p ` .
33
55
34
56
35
- ## Performance
36
-
37
- Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
38
- speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
39
-
40
- | Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
41
- | ------------| -------------| -------| -------| -----| ------------| ----------| ------------| -------------|
42
- | Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | ** 195** t/s |
43
- | Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | ** 110** t/s |
44
- | Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | ** 48** t/s |
45
- | OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | ** 321** t/s |
46
- | CodeLlama | EXL2 4.0bpw | 34B | - | - | - | - | 42 t/s | ** 48** t/s |
47
- | Llama2 | EXL2 3.0bpw | 7B | - | - | - | - | 195 t/s | ** 224** t/s |
48
- | Llama2 | EXL2 4.0bpw | 7B | - | - | - | - | 164 t/s | ** 197** t/s |
49
- | Llama2 | EXL2 5.0bpw | 7B | - | - | - | - | 144 t/s | ** 160** t/s |
50
- | Llama2 | EXL2 2.5bpw | 70B | - | - | - | - | 30 t/s | ** 35** t/s |
51
- | TinyLlama2 | EXL2 3.0bpw | 1.1B | - | - | - | - | 536 t/s | ** 635** t/s |
52
- | TinyLlama2 | EXL2 4.0bpw | 1.1B | - | - | - | - | 509 t/s | ** 590** t/s |
53
-
54
57
### Installation
55
58
56
59
Clone the repository and run ` python setup.py install --user ` . (PyPi package is coming, be patient.)
@@ -62,29 +65,31 @@ for subsequent use.
62
65
63
66
## EXL2 quantization
64
67
65
- ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new format. The new format is based on the same GPTQ/OBQ
66
- optimization approach, supporting 2, 3, 4, 5, 6 and 8-bit quantization. Most notably, by mixing them you can target any
67
- * average* bitrate from 2 up to 8 bits. This also allows for multiple quantization settings within each linear
68
- layer, producing something akin to sparse quantization wherein more important weights (columns) are quantized with more
69
- bits. The same remapping trick that lets ExLlamaV1 work efficiently with act-order models also allows this mixing
70
- of formats to happen with minimal impact on performance.
68
+ ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format. EXL2 is based on the same
69
+ optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization. The format allows for mixing quantization
70
+ levels within a model to achieve any average bitrate between 2 and 8 bits per weight.
71
71
72
- The parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
73
- error (with respect to the chosen calibration data) for each of a number of possible settings, and then finally creating
74
- a quantization scheme for the entire model that minimizes the maximum quantization error while meeting a target average
75
- bitrate .
72
+ Moreover, it's possible to apply multiple quantization levels to each linear layer, producing something akin to sparse
73
+ quantization wherein more important weights (columns) are quantized with more bits. The same remapping trick that lets
74
+ ExLlama work efficiently with act-order models allows this mixing of formats to happen with little to no impact on
75
+ performance .
76
76
77
- In my tests, this allows Llama2 70B to run on a single 24 GB GPU at the full 4096-token context, producing at least
78
- coherent output with 2.5 bits per weight. It can still be unstable, so it probably still needs a little optimization.
79
- It also only * barely* fits in 24 GB, so it most likely won't work with a desktop environment running on the same GPU.
77
+ Parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
78
+ error (with respect to the chosen calibration data) for each of a number of possible settings, per layer. Finally, a
79
+ combination is chosen that minimizes the maximum quantization error over the entire model while meeting a target
80
+ average bitrate.
80
81
81
- [ ![ chat_screenshot] ( doc/screenshot_chat_2.5bit_thumb.png )] ( doc/screenshot_chat_2.5bit.png )
82
+ In my tests, this scheme allows Llama2 70B to run on a single 24 GB GPU with a full (4k) context, producing coherent
83
+ and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently
84
+ none of them uses GQA which effectively limits the context size to 2048. In either case it's unlikely that the model
85
+ will fit alongside a desktop environment, though. For now.
82
86
83
- ### Conversion
87
+ [ ![ chat_screenshot] ( doc/llama2_70b_chat_thumb.png )] ( doc/llama2_70b_chat.png )
88
+ [ ![ chat_screenshot] ( doc/codellama_13b_instruct_thumb.png )] ( doc/codellama_13b_instruct.png )
84
89
85
- A script is provided to quantize models. Converting large models can be somewhat slow, so be warned.
90
+ ### Conversion
86
91
87
- To use it:
92
+ A script is provided to quantize models. Converting large models can be somewhat slow, so be warned. To use it:
88
93
89
94
```
90
95
python convert.py \
@@ -102,6 +107,9 @@ supplied (with the `-m` argument) to subsequent conversion jobs to skip the firs
102
107
the same model to different bitrates. Once complete, the quantized tensors will be compiled into ` output.safetensors ` ,
103
108
and this file can replace the safetensors file in the original HF model.
104
109
110
+ Roughly speaking, you'll need about 24 GB of VRAM to convert a 70B model, while 7B seems to require about 8 GB. There
111
+ are optimizations planned to accelerate conversion, utilizing more or larger GPUs.
112
+
105
113
### HuggingFace repos
106
114
107
115
I've uploaded a few EXL2-quantized models to HuggingFace, [ here] ( https://huggingface.co/turboderp ) .
0 commit comments