This repository contains the source code for the blog series "Optimizing Diffusion Inference for Production-Ready Speeds". The blog series walks through simple inference optimization on FLUX.1 and Wan T2V.
We will cover the following topics:
- How text-to-image diffusion models work and their computational challenges?
- Standard optimizations for transformer-based diffusion models
- Going deep: using faster kernels, non-trivial fusions, precomputations
- Context parallelism
- Quantization
- Caching
- LoRA
- Training
- Practice: Wan text-to-video
- Optimizing inference for uncommon deployment environments using Triton
git clone https://github.com/a-r-r-o-w/productionizing-diffusion
cd productionizing-diffusion/
uv venv venv
source venv/bin/activate
uv pip install torch==2.6 torchvision --index-url https://download.pytorch.org/whl/cu124 --verbose
uv pip install -r requirements.txt
# Make sure to have CUDA 12.4 or 12.8 (this is the only version I've tested, so you
# might have to do things differently for other versions when setting up FA2)
# https://developer.nvidia.com/cuda-12-4-0-download-archive
# Flash Attention 2 (optional, FA3 is recommended and much faster for H100, while Pytorch's cuDNN backend is
# good for both A100 and H100)
# For Python 3.10, use pre-built wheel or build from source
MAX_JOBS=4 uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation --verbose
# Flash Attention 3
# Make sure you have atleast 64 GB CPU RAM when building from source otherwise
# the installation will crash
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper
# We install v2.7.4.post1 because the latest release (2.8.x) might cause
# some installation issues which are hard to debug
# Update: 2.8.3 seems to install without any problems on CUDA 12.8 and Pytorch 2.10 nightly.
git checkout v2.7.4.post1
python setup.py install@article{fang2024xdit,
title={xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism},
author={Fang, Jiarui and Pan, Jinzhe and Sun, Xibo and Li, Aoyu and Wang, Jiannan},
journal={arXiv preprint arXiv:2411.01738},
year={2024}
}
@misc{paraattention-2025,
author = {Cheng Zeyi},
title = {ParaAttention: Context Parallel Attention for Diffusion Transformers},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/chengzeyi/ParaAttention}}
}
If you use this project, please cite it as:
@misc{avs2025optdiff,
author = {Aryan V S},
title = {Optimizing Diffusion Inference for Production-Ready Speeds},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/a-r-r-o-w/productionizing-diffusion}}
url = {https://a-r-r-o-w.github.io/blog/3_blossom/00001_productionizing_diffusion-1/index.html}
}