Skip to content

Commit c53dc46

Browse files
authored
[Doc] Remove performance warning for auto_awq.md (#12743)
1 parent 3d09e59 commit c53dc46

File tree

1 file changed

+0
-6
lines changed

1 file changed

+0
-6
lines changed

docs/source/features/quantization/auto_awq.md

-6
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,6 @@
22

33
# AutoAWQ
44

5-
:::{warning}
6-
Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
7-
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
8-
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
9-
:::
10-
115
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
126
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
137
The main benefits are lower latency and memory usage.

0 commit comments

Comments
 (0)