We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
1 parent 3d09e59 commit c53dc46Copy full SHA for c53dc46
docs/source/features/quantization/auto_awq.md
@@ -2,12 +2,6 @@
2
3
# AutoAWQ
4
5
-:::{warning}
6
-Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
7
-accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
8
-inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
9
-:::
10
-
11
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
12
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
13
The main benefits are lower latency and memory usage.
0 commit comments