turboderp-org
diff --git a/‎README.md
Lines changed: 46 additions & 38 deletions b/‎README.md
Lines changed: 46 additions & 38 deletions
diff --git a/‎doc/codellama_13b_instruct.png
184 KB b/‎doc/codellama_13b_instruct.png
184 KB
diff --git a/‎doc/codellama_13b_instruct_thumb.png
76.7 KB b/‎doc/codellama_13b_instruct_thumb.png
76.7 KB
diff --git a/‎doc/llama2_70b_chat.png
173 KB b/‎doc/llama2_70b_chat.png
173 KB
diff --git a/‎doc/llama2_70b_chat_thumb.png
94.1 KB b/‎doc/llama2_70b_chat_thumb.png
94.1 KB
diff --git a/‎doc/screenshot_chat_2.5bit.png
-207 KB b/‎doc/screenshot_chat_2.5bit.png
-207 KB
diff --git a/‎doc/screenshot_chat_2.5bit_thumb.png
-98.3 KB b/‎doc/screenshot_chat_2.5bit_thumb.png
-98.3 KB
@@ -11,7 +11,27 @@ still needs a lot of testing and tuning, and a few key features are not yet impl
 - Support for a new quant format (see below)
 
 
-## How to do stuff
+## Performance
+
+Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
+speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
+
+| Model      | Mode         | Size  | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090    |
+|------------|--------------|-------|-------|-----|------------|----------|------------|-------------|
+| Llama      | GPTQ         | 7B    | 128   | no  | 143 t/s    | 173 t/s  | 175 t/s    | **195** t/s |
+| Llama      | GPTQ         | 13B   | 128   | no  | 84 t/s     | 102 t/s  | 105 t/s    | **110** t/s |
+| Llama      | GPTQ         | 33B   | 128   | yes | 37 t/s     | 45 t/s   | 45 t/s     | **48** t/s  |
+| OpenLlama  | GPTQ         | 3B    | 128   | yes | 194 t/s    | 226 t/s  | 295 t/s    | **321** t/s |
+| CodeLlama  | EXL2 4.0 bpw | 34B   | -     | -   | -          | -        | 42 t/s     | **48** t/s  |
+| Llama2     | EXL2 3.0 bpw | 7B    | -     | -   | -          | -        | 195 t/s    | **224** t/s |
+| Llama2     | EXL2 4.0 bpw | 7B    | -     | -   | -          | -        | 164 t/s    | **197** t/s |
+| Llama2     | EXL2 5.0 bpw | 7B    | -     | -   | -          | -        | 144 t/s    | **160** t/s |
+| Llama2     | EXL2 2.5 bpw | 70B   | -     | -   | -          | -        | 30 t/s     | **35** t/s  |
+| TinyLlama  | EXL2 3.0 bpw | 1.1B  | -     | -   | -          | -        | 536 t/s    | **635** t/s |
+| TinyLlama  | EXL2 4.0 bpw | 1.1B  | -     | -   | -          | -        | 509 t/s    | **590** t/s |
+
+
+## How to
 
 Clone the repository and install dependencies:
 
@@ -25,32 +45,15 @@ python test_inference -m <path_to_model> -p "Once upon a time,"
 
 For now, a simple console chatbot is included. Run it with:
 
-`python examples/chat.py -m <path_to_model> -mode llama`
+```
+python examples/chat.py -m <path_to_model> -mode llama`
+```
 
 The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
 probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base 
 models and various other finetunes. You can also provide a custom system prompt with `-p`. 
 
 
-## Performance
-
-Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
-speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
-
-| Model      | Mode        | Size  | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090    |
-|------------|-------------|-------|-------|-----|------------|----------|------------|-------------|
-| Llama      | GPTQ        | 7B    | 128   | no  | 143 t/s    | 173 t/s  | 175 t/s    | **195** t/s |
-| Llama      | GPTQ        | 13B   | 128   | no  | 84 t/s     | 102 t/s  | 105 t/s    | **110** t/s |
-| Llama      | GPTQ        | 33B   | 128   | yes | 37 t/s     | 45 t/s   | 45 t/s     | **48** t/s  |
-| OpenLlama  | GPTQ        | 3B    | 128   | yes | 194 t/s    | 226 t/s  | 295 t/s    | **321** t/s |
-| CodeLlama  | EXL2 4.0bpw | 34B   | -     | -   | -          | -        | 42 t/s     | **48** t/s  |
-| Llama2     | EXL2 3.0bpw | 7B    | -     | -   | -          | -        | 195 t/s    | **224** t/s |
-| Llama2     | EXL2 4.0bpw | 7B    | -     | -   | -          | -        | 164 t/s    | **197** t/s |
-| Llama2     | EXL2 5.0bpw | 7B    | -     | -   | -          | -        | 144 t/s    | **160** t/s |
-| Llama2     | EXL2 2.5bpw | 70B   | -     | -   | -          | -        | 30 t/s     | **35** t/s  |
-| TinyLlama2 | EXL2 3.0bpw | 1.1B  | -     | -   | -          | -        | 536 t/s    | **635** t/s |
-| TinyLlama2 | EXL2 4.0bpw | 1.1B  | -     | -   | -          | -        | 509 t/s    | **590** t/s |
-
 ### Installation
 
 Clone the repository and run `python setup.py install --user`. (PyPi package is coming, be patient.)
@@ -62,29 +65,31 @@ for subsequent use.
 
 ## EXL2 quantization
 
-ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new format. The new format is based on the same GPTQ/OBQ
-optimization approach, supporting 2, 3, 4, 5, 6 and 8-bit quantization. Most notably, by mixing them you can target any
-*average* bitrate from 2 up to 8 bits. This also allows for multiple quantization settings within each linear
-layer, producing something akin to sparse quantization wherein more important weights (columns) are quantized with more
-bits. The same remapping trick that lets ExLlamaV1 work efficiently with act-order models also allows this mixing
-of formats to happen with minimal impact on performance. 
+ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format. EXL2 is based on the same
+optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization. The format allows for mixing quantization
+levels within a model to achieve any average bitrate between 2 and 8 bits per weight.
 
-The parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization 
-error (with respect to the chosen calibration data) for each of a number of possible settings, and then finally creating
-a quantization scheme for the entire model that minimizes the maximum quantization error while meeting a target average
-bitrate.
+Moreover, it's possible to apply multiple quantization levels to each linear layer, producing something akin to sparse 
+quantization wherein more important weights (columns) are quantized with more bits. The same remapping trick that lets
+ExLlama work efficiently with act-order models allows this mixing of formats to happen with little to no impact on
+performance.
 
-In my tests, this allows Llama2 70B to run on a single 24 GB GPU at the full 4096-token context, producing at least 
-coherent output with 2.5 bits per weight. It can still be unstable, so it probably still needs a little optimization.
-It also only *barely* fits in 24 GB, so it most likely won't work with a desktop environment running on the same GPU.
+Parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization 
+error (with respect to the chosen calibration data) for each of a number of possible settings, per layer. Finally, a
+combination is chosen that minimizes the maximum quantization error over the entire model while meeting a target
+average bitrate.
 
-[![chat_screenshot](doc/screenshot_chat_2.5bit_thumb.png)](doc/screenshot_chat_2.5bit.png)
+In my tests, this scheme allows Llama2 70B to run on a single 24 GB GPU with a full (4k) context, producing coherent 
+and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently
+none of them uses GQA which effectively limits the context size to 2048. In either case it's unlikely that the model
+will fit alongside a desktop environment, though. For now.
 
-### Conversion
+[![chat_screenshot](doc/llama2_70b_chat_thumb.png)](doc/llama2_70b_chat.png)
+[![chat_screenshot](doc/codellama_13b_instruct_thumb.png)](doc/codellama_13b_instruct.png)
 
-A script is provided to quantize models. Converting large models can be somewhat slow, so be warned.
+### Conversion
 
-To use it: 
+A script is provided to quantize models. Converting large models can be somewhat slow, so be warned. To use it: 
 
 ```
 python convert.py \
@@ -102,6 +107,9 @@ supplied (with the `-m` argument) to subsequent conversion jobs to skip the firs
 the same model to different bitrates. Once complete, the quantized tensors will be compiled into `output.safetensors`,
 and this file can replace the safetensors file in the original HF model.
 
+Roughly speaking, you'll need about 24 GB of VRAM to convert a 70B model, while 7B seems to require about 8 GB. There
+are optimizations planned to accelerate conversion, utilizing more or larger GPUs.
+
 ### HuggingFace repos
 
 I've uploaded a few EXL2-quantized models to HuggingFace, [here](https://huggingface.co/turboderp).