Skip to content

Commit e9ed4d3

Browse files
committed
Update Readme
stack-info: PR: #1526, branch: drisspg/stack/24
1 parent d42a382 commit e9ed4d3

File tree

1 file changed

+36
-39
lines changed

1 file changed

+36
-39
lines changed

README.md

Lines changed: 36 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -19,19 +19,19 @@ torchao just works with `torch.compile()` and `FSDP2` over most PyTorch models o
1919

2020
### Post Training Quantization
2121

22-
Quantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/), sparsity [here](/torchao/_models/sam/README.md) and a HuggingFace inference example [here](scripts/hf_eval.py)
22+
Quantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/README.md), sparsity [here](torchao/sparsity/README.md) and a HuggingFace inference example [here](scripts/hf_eval.py)
2323

2424
For inference, we have the option of
2525
1. Quantize only the weights: works best for memory bound models
2626
2. Quantize the weights and activations: works best for compute bound models
27-
2. Quantize the activations and weights and sparsify the weight
27+
3. Quantize the activations and weights and sparsify the weight
2828

2929
```python
3030
from torchao.quantization.quant_api import (
3131
quantize_,
3232
int8_dynamic_activation_int8_weight,
33+
float8_dynamic_activation_float8_weight
3334
int4_weight_only,
34-
int8_weight_only
3535
)
3636
quantize_(m, int4_weight_only())
3737
```
@@ -52,11 +52,12 @@ We also provide a developer facing API so you can implement your own quantizatio
5252

5353
We've added kv cache quantization and other features in order to enable long context length (and necessarily memory efficient) inference.
5454

55-
In practice these features alongside int4 weight only quantization allow us to **reduce peak memory by ~55%**, meaning we can Llama3.1-8B inference with a **130k context length with only 18.9 GB of peak memory.** More details can be found [here](torchao/_models/llama/README.md)
55+
In practice these features alongside int4 weight only quantization allow us to **reduce peak memory by ~55%**, meaning we can Llama3.1-8B inference with a **130k context length with only 18.9 GB of peak memory.** More details can be found [here](torchao/_models/llama/README.md#kv-cache-quantization---memory-efficient-inference)
56+
5657

5758
### Quantization Aware Training
5859

59-
Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with Torchtune, we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe [here](https://pytorch.org/blog/quantization-aware-training/)
60+
Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with [Torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md#quantization-aware-training-qat), we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe [here](https://pytorch.org/blog/quantization-aware-training/)
6061

6162
```python
6263
from torchao.quantization.qat import Int8DynActInt4WeightQATQuantizer
@@ -96,6 +97,8 @@ We've added support for semi-structured 2:4 sparsity with **6% end-to-end speedu
9697
The code change is a 1 liner with the full example available [here](torchao/sparsity/training/)
9798

9899
```python
100+
from torchao.sparsity.training import SemiSparseLinear, swap_linear_with_semi_sparse_linear
101+
99102
swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear})
100103
```
101104

@@ -117,58 +120,52 @@ optim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)
117120
optim.load_state_dict(ckpt["optim"])
118121
```
119122

120-
## Composability
123+
## Installation
121124

122-
1. `torch.compile`: A key design principle for us is composability as in any new dtype or layout we provide needs to work with our compiler. It shouldn't matter if the kernels are written in pure PyTorch, CUDA, C++, or Triton - things should just work! So we write the dtype, layout, or bit packing logic in pure PyTorch and code-generate efficient kernels.
123-
3. [FSDP2](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md): Historically most quantization has been done for inference, there is now a thriving area of research combining distributed algorithms and quantization.
125+
`torchao` makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch, see [getting started](https://pytorch.org/get-started/locally/) for more details.
124126

125-
The best example we have combining the composability of lower bit dtype with compile and fsdp is [NF4](torchao/dtypes/nf4tensor.py) which we used to implement the [QLoRA](https://www.youtube.com/watch?v=UvRl4ansfCg) algorithm. So if you're doing research at the intersection of this area we'd love to hear from you.
127+
#### Stable release from PyPI which will default to CUDA 12.4
126128

127-
## Custom Kernels
129+
```Shell
130+
pip install torchao
131+
```
128132

129-
We've added support for authoring and releasing [custom ops](./torchao/csrc/) that do not graph break with `torch.compile()` so if you love writing kernels but hate packaging them so they work all operating systems and cuda versions, we'd love to accept contributions for your custom ops. We have a few examples you can follow
133+
#### Nightly Release
134+
```Shell
135+
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu124 # full options are cpu/cu118/cu121/cu124
136+
```
130137

131-
1. [fp6](torchao/dtypes/floatx) for 2x faster inference over fp16 with an easy to use API `quantize_(model, fpx_weight_only(3, 2))`
132-
2. [2:4 Sparse Marlin GEMM](https://github.com/pytorch/ao/pull/733) 2x speedups for FP16xINT4 kernels even at batch sizes up to 256
133-
3. [int4 tinygemm unpacker](https://github.com/pytorch/ao/pull/415) which makes it easier to switch quantized backends for inference
138+
#### From Source
134139

135-
If you believe there's other CUDA kernels we should be taking a closer look at please leave a comment on [this issue](https://github.com/pytorch/ao/issues/697)
140+
For *most* developers you probably want to skip building custom C++/CUDA extensions for faster iteration
136141

142+
```Shell
143+
USE_CPP=0 pip install -e .
144+
```
137145

138-
## Alpha features
139146

140-
Things we're excited about but need more time to cook in the oven
147+
## Composability
148+
`torch.compile`: A key design principle for us is composability - any custom dtype or memory layout should work with our compiler. We enable kernel implementations in PyTorch, CUDA, C++, or Triton. This allows researchers and engineers to start with high-level dtype and layout logic in pure PyTorch, then progressively optimize performance by implementing lower-level kernels as needed, while maintaining compatibility with the compile infrastructure.
141149

142-
1. [MX](torchao/prototype/mx_formats) training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.
143-
2. [Int8 Quantized Training](https://github.com/pytorch/ao/tree/main/torchao/prototype/quantized_training): We're trying out full int8 training. This is easy to use with `quantize_(model, int8_weight_only_quantized_training())`. This work is prototype as the memory benchmarks are not compelling yet.
144-
3. [IntX](https://github.com/pytorch/ao/tree/main/torchao/dtypes/uintx): We've managed to support all the ints by doing some clever bitpacking in pure PyTorch and then compiling it. This work is prototype as unfortunately without some more investment in either the compiler or low-bit kernels, int4 is more compelling than any smaller dtype
145-
4. [Bitnet](https://github.com/pytorch/ao/blob/main/torchao/prototype/dtypes/bitnet.py): Mostly this is very cool to people on the team. This is prototype because how useful these kernels are is highly dependent on better hardware and kernel support.
150+
[FSDP2](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md): Historically most quantization has been done for inference, there is now a thriving area of research combining distributed algorithms and quantization.
146151

147-
## Installation
152+
The best example we have combining the composability of lower bit dtype with compile and fsdp is [NF4](torchao/dtypes/nf4tensor.py) which we used to implement the [QLoRA](https://www.youtube.com/watch?v=UvRl4ansfCg) algorithm. So if you're doing research at the intersection of this area we'd love to hear from you.
148153

149-
`torchao` makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.
154+
## Custom Kernels
150155

151-
Stable release from Pypi which will default to CUDA 12.1
156+
We've added support for authoring and releasing [custom ops](./torchao/csrc/) that do not graph break with `torch.compile()`. We have a few examples you can follow
152157

153-
```Shell
154-
pip install torchao
155-
```
158+
1. [fp6](torchao/dtypes/floatx/README.md) for 2x faster inference over fp16 with an easy to use API `quantize_(model, fpx_weight_only(3, 2))`
159+
2. [2:4 Sparse Marlin GEMM](https://github.com/pytorch/ao/pull/733) 2x speedups for FP16xINT4 kernels even at batch sizes up to 256
160+
3. [int4 tinygemm unpacker](https://github.com/pytorch/ao/pull/415) which makes it easier to switch quantized backends for inference
156161

157-
Stable Release from the PyTorch index
158-
```Shell
159-
pip install torchao --extra-index-url https://download.pytorch.org/whl/cu121 # full options are cpu/cu118/cu121/cu124
160-
```
162+
If you believe there's other CUDA kernels we should be taking a closer look at please leave a comment on [this issue](https://github.com/pytorch/ao/issues/697) or feel free to contribute directly to the repo.
161163

162-
Nightly Release
163-
```Shell
164-
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124
165-
```
166164

167-
For *most* developers you probably want to skip building custom C++/CUDA extensions for faster iteration
165+
## Prototype Features
166+
167+
Check out our [prototype directory](torchao/prototype/README.md) where we experiment with cutting-edge model optimization techniques for both training and inference. If you're interested in contributing experimental research or want to explore feel free to open an issue or PR.
168168

169-
```Shell
170-
USE_CPP=0 pip install -e .
171-
```
172169

173170
## OSS Integrations
174171

0 commit comments

Comments
 (0)