You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
promote blocksparse from prototype, make it faster (#1734)
This PR promotes block sparsity from prototype in torchao.
Chiefly, it ports over the triton addmm blocksparse kernels from core, and makes several performance improvements to them.
All of the numbers reported below are for an H100, with blocksize=64 and sparsity_level=0.9. The default dense baseline is 134 tok/s
1) Adds padding support to the triton kernel for dense matrices with dimension < 16, like those we run into during decoding. (214 -> 218 tok/s)
2) Changes the default [num_stages](triton-lang/triton#512) parameter from 1 to 4. This has a large effect on performance, and it seemed like the default kernel autotuning either does not modify or deems this parameter to be unimportant for some reason. (218 -> 263 tok/s).
3) Adds an env_var, BSR_AUTOTUNE, that users can use if they want to do kernel autotuning on top of the default parameters. (263 -> 266 tok/s) This seems to matter more for bs=n compute bound workloads, where I see a reduction from 0.3855 to 0.3745s on bs=8192 prefill (roughly 3%)
So in total we are seeing a **1.985x** speedup 🚀
I've also updated the documentation to not reference prototype - planning on updating the diagram in a subsequent PR.
### Testing
I added a new test case for the padding inputs and moved the test file out of prototype.
```
python test/sparsity/test_sparse_api.py
```
0 commit comments