Skip to content

Commit f4c55d3

Browse files
docs : add performance troubleshoot + example benchmark documentation (ggml-org#1674)
* test anchor link * test table * add benchmarks * Add performance troubleshoot & benchmark * add benchmarks * remove unneeded line --------- Co-authored-by: Georgi Gerganov <[email protected]>
1 parent f146562 commit f4c55d3

File tree

3 files changed

+47
-6
lines changed

3 files changed

+47
-6
lines changed

README.md

+7-6
Original file line numberDiff line numberDiff line change
@@ -267,11 +267,11 @@ Any value larger than 0 will offload the computation to the GPU. For example:
267267

268268
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
269269
270-
- **Accelerate Framework**:
270+
- #### Accelerate Framework:
271271
272272
This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
273273

274-
- **OpenBLAS**:
274+
- #### OpenBLAS:
275275

276276
This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
277277

@@ -305,11 +305,11 @@ Building the program with BLAS support may lead to some performance improvements
305305
cmake --build . --config Release
306306
```
307307

308-
- **BLIS**
308+
- #### BLIS
309309

310310
Check [BLIS.md](BLIS.md) for more information.
311311

312-
- **Intel MKL**
312+
- #### Intel MKL
313313

314314
By default, `LLAMA_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DLLAMA_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. You may also specify it by:
315315

@@ -320,7 +320,7 @@ Building the program with BLAS support may lead to some performance improvements
320320
cmake --build . --config Release
321321
```
322322

323-
- **cuBLAS**
323+
- #### cuBLAS
324324

325325
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
326326
- Using `make`:
@@ -339,7 +339,7 @@ Building the program with BLAS support may lead to some performance improvements
339339
340340
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used.
341341
342-
- **CLBlast**
342+
- #### CLBlast
343343
344344
OpenCL acceleration is provided by the matrix multiplication kernels from the [CLBlast](https://github.com/CNugteren/CLBlast) project and custom kernels for ggml that can generate tokens on the GPU.
345345
@@ -684,3 +684,4 @@ docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /mode
684684
### Docs
685685
686686
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
687+
- [Performance troubleshooting](./docs/token_generation_performance_tips.md)

BLIS.md docs/BLIS.md

File renamed without changes.
+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Token generation performance troubleshooting
2+
3+
## Verifying that the model is running on the GPU with cuBLAS
4+
Make sure you compiled llama with the correct env variables according to [this guide](../README.md#cublas), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example:
5+
```shell
6+
./main -m "path/to/model.bin" -ngl 200000 -p "Please sir, may I have some "
7+
```
8+
9+
When running llama, before it starts the inference work, it will output diagnostic information that shows whether cuBLAS is offloading work to the GPU. Look for these lines:
10+
```shell
11+
llama_model_load_internal: [cublas] offloading 60 layers to GPU
12+
llama_model_load_internal: [cublas] offloading output layer to GPU
13+
llama_model_load_internal: [cublas] total VRAM used: 17223 MB
14+
... rest of inference
15+
```
16+
17+
If you see these lines, then the GPU is being used.
18+
19+
## Verifying that the CPU is not oversaturated
20+
llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physicial CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down.
21+
22+
# Example of runtime flags effect on inference speed benchmark
23+
These runs were tested on the following machine:
24+
GPU: A6000 (48GB VRAM)
25+
CPU: 7 physical cores
26+
RAM: 32GB
27+
28+
Model: `TheBloke_Wizard-Vicuna-30B-Uncensored-GGML/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin` (30B parameters, 4bit quantization, GGML)
29+
30+
Run command: `./main -m "path/to/model.bin" -p "-p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 1000 [additional benchmark flags]`
31+
32+
Result:
33+
34+
| command | tokens/second (higher is better) |
35+
| - | - |
36+
| -ngl 2000000 | N/A (less than 0.1) |
37+
| -t 7 | 1.7 |
38+
| -t 1 -ngl 2000000 | 5.5 |
39+
| -t 7 -ngl 2000000 | 8.7 |
40+
| -t 4 -ngl 2000000 | 9.1 |

0 commit comments

Comments
 (0)