You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-13
Original file line number
Diff line number
Diff line change
@@ -29,16 +29,16 @@ For inference, we have the option of
29
29
```python
30
30
from torchao.quantization.quant_api import (
31
31
quantize_,
32
-
int8_dynamic_activation_int8_weight,
33
-
int4_weight_only,
34
-
int8_weight_only
32
+
Int8DynamicActivationInt8WeightConfig,
33
+
Int4WeightOnlyConfig,
34
+
Int8WeightOnlyConfig
35
35
)
36
-
quantize_(m, int4_weight_only())
36
+
quantize_(m, Int4WeightOnlyConfig())
37
37
```
38
38
39
-
For gpt-fast `int4_weight_only()` is the best option at bs=1 as it **2x the tok/s and reduces the VRAM requirements by about 65%** over a torch.compiled baseline.
39
+
For gpt-fast `Int4WeightOnlyConfig()` is the best option at bs=1 as it **2x the tok/s and reduces the VRAM requirements by about 65%** over a torch.compiled baseline.
40
40
41
-
If you don't have enough VRAM to quantize your entire model on GPU and you find CPU quantization to be too slow then you can use the device argument like so `quantize_(model, int8_weight_only(), device="cuda")` which will send and quantize each layer individually to your GPU.
41
+
If you don't have enough VRAM to quantize your entire model on GPU and you find CPU quantization to be too slow then you can use the device argument like so `quantize_(model, Int8WeightOnlyConfig(), device="cuda")` which will send and quantize each layer individually to your GPU.
42
42
43
43
If you see slowdowns with any of these techniques or you're unsure which option to use, consider using [autoquant](./torchao/quantization/README.md#autoquantization) which will automatically profile layers and pick the best way to quantize each layer.
44
44
@@ -63,27 +63,27 @@ Post-training quantization can result in a fast and compact model, but may also
@@ -139,7 +139,7 @@ The best example we have combining the composability of lower bit dtype with com
139
139
140
140
We've added support for authoring and releasing [custom ops](./torchao/csrc/) that do not graph break with `torch.compile()` so if you love writing kernels but hate packaging them so they work all operating systems and cuda versions, we'd love to accept contributions for your custom ops. We have a few examples you can follow
141
141
142
-
1.[fp6](torchao/dtypes/floatx) for 2x faster inference over fp16 with an easy to use API `quantize_(model, fpx_weight_only(3, 2))`
142
+
1.[fp6](torchao/dtypes/floatx) for 2x faster inference over fp16 with an easy to use API `quantize_(model, FPXWeightOnlyConfig(3, 2))`
143
143
2.[2:4 Sparse Marlin GEMM](https://github.com/pytorch/ao/pull/733) 2x speedups for FP16xINT4 kernels even at batch sizes up to 256
144
144
3.[int4 tinygemm unpacker](https://github.com/pytorch/ao/pull/415) which makes it easier to switch quantized backends for inference
or whichever other files you'd like to use for study. For example you may consider the Segment Anything Video (SA-V) [Dataset](https://github.com/facebookresearch/sam2/tree/main/sav_dataset#download-the-dataset).
19
+
20
+
The experimental results will then be saved under `output_folder` in result.csv
21
+
22
+
# Reproducing experiments on Modal
23
+
24
+
For this you can run `modal_experiments.sh` after, but you'll want to experiments locally first to produce the meta annotations and exported ahead-of-time compiled binaries.
25
+
26
+
# Using the server locally
1
27
## Example curl command
2
28
```
3
29
curl -X POST http://127.0.0.1:5000/upload -F 'image=@/path/to/file.jpg' --output path/to/output.png
0 commit comments