Skip to content

a-r-r-o-w/productionizing-diffusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

productionizing-diffusion

This repository contains the source code for the blog series "Optimizing Diffusion Inference for Production-Ready Speeds". The blog series walks through simple inference optimization on FLUX.1 and Wan T2V.

We will cover the following topics:

  1. How text-to-image diffusion models work and their computational challenges?
  2. Standard optimizations for transformer-based diffusion models
  3. Going deep: using faster kernels, non-trivial fusions, precomputations
  4. Context parallelism
  5. Quantization
  6. Caching
  7. LoRA
  8. Training
  9. Practice: Wan text-to-video
  10. Optimizing inference for uncommon deployment environments using Triton
Post Topics covered
Optimizing diffusion inference for production-ready speeds - I 1, 2
Optimizing diffusion inference for production-ready speeds - II 3, 4
Optimizing diffusion inference for production-ready speeds - III 5, 6
Optimizing diffusion inference for production-ready speeds - IV 7, 8, 9, 10

installation

git clone https://github.com/a-r-r-o-w/productionizing-diffusion
cd productionizing-diffusion/

uv venv venv
source venv/bin/activate

uv pip install torch==2.6 torchvision --index-url https://download.pytorch.org/whl/cu124 --verbose
uv pip install -r requirements.txt

# Make sure to have CUDA 12.4 or 12.8 (this is the only version I've tested, so you
# might have to do things differently for other versions when setting up FA2)
# https://developer.nvidia.com/cuda-12-4-0-download-archive

# Flash Attention 2 (optional, FA3 is recommended and much faster for H100, while Pytorch's cuDNN backend is
# good for both A100 and H100)
# For Python 3.10, use pre-built wheel or build from source
MAX_JOBS=4 uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation --verbose

# Flash Attention 3
# Make sure you have atleast 64 GB CPU RAM when building from source otherwise
# the installation will crash
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper
# We install v2.7.4.post1 because the latest release (2.8.x) might cause
# some installation issues which are hard to debug
# Update: 2.8.3 seems to install without any problems on CUDA 12.8 and Pytorch 2.10 nightly.
git checkout v2.7.4.post1
python setup.py install

references

@article{fang2024xdit,
  title={xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism},
  author={Fang, Jiarui and Pan, Jinzhe and Sun, Xibo and Li, Aoyu and Wang, Jiannan},
  journal={arXiv preprint arXiv:2411.01738},
  year={2024}
}
@misc{paraattention-2025,
  author = {Cheng Zeyi},
  title = {ParaAttention: Context Parallel Attention for Diffusion Transformers},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/chengzeyi/ParaAttention}}
}

citation

If you use this project, please cite it as:

@misc{avs2025optdiff,
  author = {Aryan V S},
  title = {Optimizing Diffusion Inference for Production-Ready Speeds},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/a-r-r-o-w/productionizing-diffusion}}
  url = {https://a-r-r-o-w.github.io/blog/3_blossom/00001_productionizing_diffusion-1/index.html}
}

About

Optimizing diffusion inference for production-ready speeds

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published