create a script to train autoencoderkl #10605

lavinal712 · 2025-01-18T12:31:40Z

Add AutoencoderKL Training Script

Description

This PR adds a complete training script for AutoencoderKL models. The script supports the following features:

Multiple loss functions (L1/L2 reconstruction loss, LPIPS perceptual loss, KL divergence)
Integrated adversarial training
Mixed precision training support
Comprehensive training logging and validation
Hugging Face Hub model upload
Detailed command-line parameter configuration

Key Features

Core Training Pipeline

Complete VAE training loop implementation
Multi-GPU distributed training support
Gradient accumulation and checkpoint saving

Loss Functions

Reconstruction loss (L1/L2)
LPIPS perceptual loss
KL divergence regularization
Adversarial loss

Optimization & Performance

8-bit Adam optimizer support
Integrated xFormers memory optimization
TF32 acceleration support

Monitoring & Validation

TensorBoard and WandB logging
Periodic validation with sample image saving
Detailed training metrics monitoring

@sayakpaul

sayakpaul · 2025-01-19T14:10:15Z

Hello, thanks so much for your contributions. Could you perhaps provide some decent results you obtained with the training script? Could you also help explain the main differences between this and the training script we have for vqgan?

sayakpaul · 2025-01-19T14:10:43Z

@ariG23498 you might be interested in following this PR as you were looking for this for a while.

lavinal712 · 2025-01-20T08:41:24Z

This code is inspired by https://github.com/CompVis/latent-diffusion https://github.com/Stability-AI/generative-models #894 #3801 and aims to provide a streamlined approach for fine-tuning or training VAEs using diffusers with minimal code. The key distinction between this implementation and VQGAN is the incorporation of KL loss for VAE training, along with support for training the decoder independently.

lavinal712 · 2025-01-20T08:46:35Z

lavinal712 · 2025-01-20T08:47:00Z

The above results were obtained using this script to train an SD-VAE from scratch on ImageNet, with only 1,000 steps of training completed.

lavinal712 · 2025-01-20T08:47:26Z

I apologize that I am currently unable to provide more thoroughly trained results. Experiments have shown that the training speed of this script is relatively slow, and the trained VAE model can only offer basic image reconstruction at this stage. Further experiments are yet to be conducted.

sayakpaul · 2025-01-21T01:53:56Z

Thanks for this and the results aren't underwhelming at all! Perhaps we could place the project under "research_projects" for now and expedite the merging? And then once you have more time to conduct experiments, we could bring it back to examples/?

WDYT?

sayakpaul

Thanks, this is already very high-quality. I left some comments. LMK if they make sense.

sayakpaul · 2025-01-21T02:00:54Z

examples/autoencoderkl/README.md

+accelerate config
+```
+
+## Training on ImageNet


Let's provide a smaller dataset here in the example.

sayakpaul · 2025-01-21T02:01:11Z

examples/autoencoderkl/README.md

+## Training on ImageNet
+
+```bash
+accelerate launch --multi_gpu --num_processes 4 --mixed_precision bf16 train_autoencoderkl.py \


Let's keep it for a single GPU and then make a note about multi-GPU later.

sayakpaul · 2025-01-21T02:01:23Z

examples/autoencoderkl/README.md

+    --report_to wandb \
+    --mixed_precision bf16 \
+    --train_data_dir /path/to/ImageNet/train \
+    --validation_image ./image.png \


Where does it come from?

The validation images are randomly selected from the ImageNet validation set, consisting of eight images. Here, they are simply represented as an abstract ./image.png for illustrative purposes.

sayakpaul · 2025-01-21T02:10:06Z