Skip to content

Commit c240eb0

Browse files
committed
Update README.md
1 parent 7704a68 commit c240eb0

7 files changed

+46
-38
lines changed

README.md

+46-38
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,27 @@ still needs a lot of testing and tuning, and a few key features are not yet impl
1111
- Support for a new quant format (see below)
1212

1313

14-
## How to do stuff
14+
## Performance
15+
16+
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
17+
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
18+
19+
| Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
20+
|------------|--------------|-------|-------|-----|------------|----------|------------|-------------|
21+
| Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | **195** t/s |
22+
| Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | **110** t/s |
23+
| Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | **48** t/s |
24+
| OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | **321** t/s |
25+
| CodeLlama | EXL2 4.0 bpw | 34B | - | - | - | - | 42 t/s | **48** t/s |
26+
| Llama2 | EXL2 3.0 bpw | 7B | - | - | - | - | 195 t/s | **224** t/s |
27+
| Llama2 | EXL2 4.0 bpw | 7B | - | - | - | - | 164 t/s | **197** t/s |
28+
| Llama2 | EXL2 5.0 bpw | 7B | - | - | - | - | 144 t/s | **160** t/s |
29+
| Llama2 | EXL2 2.5 bpw | 70B | - | - | - | - | 30 t/s | **35** t/s |
30+
| TinyLlama | EXL2 3.0 bpw | 1.1B | - | - | - | - | 536 t/s | **635** t/s |
31+
| TinyLlama | EXL2 4.0 bpw | 1.1B | - | - | - | - | 509 t/s | **590** t/s |
32+
33+
34+
## How to
1535

1636
Clone the repository and install dependencies:
1737

@@ -25,32 +45,15 @@ python test_inference -m <path_to_model> -p "Once upon a time,"
2545

2646
For now, a simple console chatbot is included. Run it with:
2747

28-
`python examples/chat.py -m <path_to_model> -mode llama`
48+
```
49+
python examples/chat.py -m <path_to_model> -mode llama`
50+
```
2951

3052
The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
3153
probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base
3254
models and various other finetunes. You can also provide a custom system prompt with `-p`.
3355

3456

35-
## Performance
36-
37-
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
38-
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
39-
40-
| Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
41-
|------------|-------------|-------|-------|-----|------------|----------|------------|-------------|
42-
| Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | **195** t/s |
43-
| Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | **110** t/s |
44-
| Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | **48** t/s |
45-
| OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | **321** t/s |
46-
| CodeLlama | EXL2 4.0bpw | 34B | - | - | - | - | 42 t/s | **48** t/s |
47-
| Llama2 | EXL2 3.0bpw | 7B | - | - | - | - | 195 t/s | **224** t/s |
48-
| Llama2 | EXL2 4.0bpw | 7B | - | - | - | - | 164 t/s | **197** t/s |
49-
| Llama2 | EXL2 5.0bpw | 7B | - | - | - | - | 144 t/s | **160** t/s |
50-
| Llama2 | EXL2 2.5bpw | 70B | - | - | - | - | 30 t/s | **35** t/s |
51-
| TinyLlama2 | EXL2 3.0bpw | 1.1B | - | - | - | - | 536 t/s | **635** t/s |
52-
| TinyLlama2 | EXL2 4.0bpw | 1.1B | - | - | - | - | 509 t/s | **590** t/s |
53-
5457
### Installation
5558

5659
Clone the repository and run `python setup.py install --user`. (PyPi package is coming, be patient.)
@@ -62,29 +65,31 @@ for subsequent use.
6265

6366
## EXL2 quantization
6467

65-
ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new format. The new format is based on the same GPTQ/OBQ
66-
optimization approach, supporting 2, 3, 4, 5, 6 and 8-bit quantization. Most notably, by mixing them you can target any
67-
*average* bitrate from 2 up to 8 bits. This also allows for multiple quantization settings within each linear
68-
layer, producing something akin to sparse quantization wherein more important weights (columns) are quantized with more
69-
bits. The same remapping trick that lets ExLlamaV1 work efficiently with act-order models also allows this mixing
70-
of formats to happen with minimal impact on performance.
68+
ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format. EXL2 is based on the same
69+
optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization. The format allows for mixing quantization
70+
levels within a model to achieve any average bitrate between 2 and 8 bits per weight.
7171

72-
The parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
73-
error (with respect to the chosen calibration data) for each of a number of possible settings, and then finally creating
74-
a quantization scheme for the entire model that minimizes the maximum quantization error while meeting a target average
75-
bitrate.
72+
Moreover, it's possible to apply multiple quantization levels to each linear layer, producing something akin to sparse
73+
quantization wherein more important weights (columns) are quantized with more bits. The same remapping trick that lets
74+
ExLlama work efficiently with act-order models allows this mixing of formats to happen with little to no impact on
75+
performance.
7676

77-
In my tests, this allows Llama2 70B to run on a single 24 GB GPU at the full 4096-token context, producing at least
78-
coherent output with 2.5 bits per weight. It can still be unstable, so it probably still needs a little optimization.
79-
It also only *barely* fits in 24 GB, so it most likely won't work with a desktop environment running on the same GPU.
77+
Parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
78+
error (with respect to the chosen calibration data) for each of a number of possible settings, per layer. Finally, a
79+
combination is chosen that minimizes the maximum quantization error over the entire model while meeting a target
80+
average bitrate.
8081

81-
[![chat_screenshot](doc/screenshot_chat_2.5bit_thumb.png)](doc/screenshot_chat_2.5bit.png)
82+
In my tests, this scheme allows Llama2 70B to run on a single 24 GB GPU with a full (4k) context, producing coherent
83+
and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently
84+
none of them uses GQA which effectively limits the context size to 2048. In either case it's unlikely that the model
85+
will fit alongside a desktop environment, though. For now.
8286

83-
### Conversion
87+
[![chat_screenshot](doc/llama2_70b_chat_thumb.png)](doc/llama2_70b_chat.png)
88+
[![chat_screenshot](doc/codellama_13b_instruct_thumb.png)](doc/codellama_13b_instruct.png)
8489

85-
A script is provided to quantize models. Converting large models can be somewhat slow, so be warned.
90+
### Conversion
8691

87-
To use it:
92+
A script is provided to quantize models. Converting large models can be somewhat slow, so be warned. To use it:
8893

8994
```
9095
python convert.py \
@@ -102,6 +107,9 @@ supplied (with the `-m` argument) to subsequent conversion jobs to skip the firs
102107
the same model to different bitrates. Once complete, the quantized tensors will be compiled into `output.safetensors`,
103108
and this file can replace the safetensors file in the original HF model.
104109

110+
Roughly speaking, you'll need about 24 GB of VRAM to convert a 70B model, while 7B seems to require about 8 GB. There
111+
are optimizations planned to accelerate conversion, utilizing more or larger GPUs.
112+
105113
### HuggingFace repos
106114

107115
I've uploaded a few EXL2-quantized models to HuggingFace, [here](https://huggingface.co/turboderp).

doc/codellama_13b_instruct.png

184 KB
Loading

doc/codellama_13b_instruct_thumb.png

76.7 KB
Loading

doc/llama2_70b_chat.png

173 KB
Loading

doc/llama2_70b_chat_thumb.png

94.1 KB
Loading

doc/screenshot_chat_2.5bit.png

-207 KB
Binary file not shown.

doc/screenshot_chat_2.5bit_thumb.png

-98.3 KB
Binary file not shown.

0 commit comments

Comments
 (0)