Skip to content

ValerianFourel/StableFace

Repository files navigation

StableFaceEmotion

Original face Baseline generation Finetuned generation

Left → Right: Original AffectNet photo  •  Baseline Realistic-Vision output  •  StableFaceEmotion finetuned output

Description: Happy, Close-up woman's face, smiling, long dark hair, fair skin, youthful appearance, light-colored eyes, well-defined eyebrows, straight nose, full lips, makeup, softened lighting, neutral background, facial features focused

Fine-tune Stable Diffusion 1.5 to generate photorealistic faces with controllable emotions.
The project extends the checkpoint SG161222/Realistic_Vision_V6.0_B1_noVAE with

  • a lightweight multi-modal guidance stack (depth, landmarks, FLAME render),
  • a composite loss (L₁ + LPIPS + EmoNet),
  • and large-scale balanced AffectNet supervision.

1 · Quick start

1.1 Single-GPU training

# Clone
git clone https://github.com/ValerianFourel/StableFace.git
cd StableFace

# Install (Python ≥ 3.10)
pip install -r requirements.txt
pip install --extra-index-url https://download.pytorch.org/whl/cu118 \
            torch==2.2.0+cu118 torchvision==0.17.0+cu118

# Launch (single GPU)
accelerate launch train_lpips_emonet_text_to_image.py

1.2 Multi-GPU training

accelerate --multi_gpu launch train_lpips_emonet_text_to_image.py

The default accelerate config uses DistributedDataParallel and gradient-accumulation to reach an effective batch size ≈ 1024.


2 · Hardware & software requirements

Component Minimum Notes
GPU NVIDIA A100-SXM4-80 GB 2 × 80 GB tested
GPU NVIDIA H100 80 GB HBM3 alternative
CUDA 11.8 required
vRAM 160 GB total LPIPS feature maps

HTCondor snippets:

  • A100

    condor_submit_bid 1000 -i \
        -append request_memory=281920 \
        -append request_cpus=10 \
        -append request_disk=100G \
        -append request_gpus=2 \
        -append 'requirements = CUDADeviceName == "NVIDIA A100-SXM4-80GB"'
  • H100

    condor_submit_bid 1000 -i \
        -append request_memory=281920 \
        -append request_cpus=10 \
        -append request_disk=100G \
        -append request_gpus=2 \
        -append 'requirements = CUDADeviceName == "NVIDIA H100 80GB HBM3"'

Per-GPU batch presets

  • L₁ + LPIPS → 4
  • L₁ only → 8

3 · Dataset layout

EmocaProcessed_38k/
├─ geometry_detail/      # FLAME renders
└─ inputs/               # cropped faces

38 000 AffectNet images balanced across 8 Ekman emotions.
Upcoming (≥ 2024-10-13): renders pasted on original canvas for depth/semantic/skeleton supervision.

Helpful tools


4 · Checkpoint packaging helper

cp -r feature_extractor model_index.json safety_checker \
      scheduler text_encoder tokenizer vae \
      ../AllGuidances_2-sd-model-finetuned-l192_lpips08-emonet08-snr08-lr56-1024pics_224res/checkpoint-176/

cp -r unet/* \
      ../AllGuidances_2-sd-model-finetuned-l192_lpips08-emonet08-snr08-lr56-1024pics_224res/checkpoint-176/unet/

5 · Loss function

Term Weight Purpose
L₁ 0.92 Pixel fidelity
LPIPS 0.08 Perceptual realism
EmoNet Valence 0.03 Affective intensity
EmoNet Arousal 0.03 Affective intensity
EmoNet Expression 0.02 Discrete class

6 · Guidance stack

  • Depth map
  • 2-D landmarks
  • FLAME mesh render

All encoded by a small transformer → preserves identity & head pose while changing expression.


8 · Inference (triptych comparison)

The repository ships with a ready-to-use script that

  1. loads both the original Realistic-Vision checkpoint and your fine-tuned StableFaceEmotion weights,
  2. synthesises an image for every prompt found in a JSON validation file,
  3. builds a “triptych” (original photo ➜ base SD image ➜ fine-tuned SD image) and saves it to disk.

Command

If you only want to generate triptychs, use the lightweight inference wrapper:

python inference.py \
    --config ./configs/inference/flame_emonet_validation.yaml

The YAML config exposes:

  • pretrained_model_name_or_path – base checkpoint (e.g. SG161222/Realistic_Vision_V6.0_B1_noVAE)
  • finetuned_model – path or HF repo of the StableFaceEmotion weights
  • negative_prompt / negative_prompt2 – long-form negative prompts already embedded in the script
  • seed – set to reproduce identical outputs
  • validation_dict – JSON mapping image-path ➜ prompt (used to build triptychs)
  • output_folder – where triptychs will be written

The pipeline will automatically:

  • download / load the tokenizer, text-encoder, UNet, VAE and guidance encoders,
  • disable the safety-checker on the base model (to ensure a fair comparison),
  • run DDPM sampling (num_inference_steps = 300, guidance_scale = 9.0 by default),
  • place the three 512 × 512 images side-by-side with the cleaned prompt as caption.

Once finished, all files live under output_folder/subfolder/filename.png, mirroring the dataset hierarchy.

7 · Validation / evaluation

Run quantitative evaluation on a 300-image validation split:

python validation_finetuning_Emotions.py \
    --cfg-path train_configs/minigptv2_finetune_gpt4vision_Full.yaml \
    --image-dir /fast/vfourel/FaceGPT/Data/StableFaceData/AffectNet41k_FlameRender_Descriptions_Images/affectnet_41k_AffectOnly/Manually_Annotated/Manually_Annotated_Images \
    --ground-truth /fast/vfourel/FaceGPT/Data/StableFaceData/AffectNet41k_FlameRender_Descriptions_Images/affectnet_41k_AffectOnly/EmocaProcessed_38k/Modified_Corpus_300_validation.json

The script reports FID, DISTS and EmoNet Top-k accuracy.

7.1 · Qualitative examples

Each triptych shows: left → original AffectNet photo, middle → baseline Realistic-Vision output, right → StableFaceEmotion output.

Original → Baseline → Finetune Original → Baseline → Finetune Original → Baseline → Finetune

These visuals highlight sharper textures and noticeably improved emotion fidelity after fine-tuning.

8 · Results

Metric Base RV6.0-B1 StableFaceEmotion Δ
FID ↓ 106.0 84.4 −21.6
DISTS ↓ 0.329 0.320 −2.6 %
EmoNet Top-1 ↑ 31 % 39 % +8 pp
EmoNet Top-3 ↑ 62 % 72 % +10 pp

Largest gains: anger, disgust, surprise.

---

9 · Pre-trained weights & demo


10 · Citation

@misc{fourel2025stablefaceemotion,
  title  = {StableFaceEmotion: Structure-Aware Emotion Control for Stable Diffusion},
  author = {Valérian Fourel},
  year   = {2025},
  url    = {https://github.com/ValerianFourel/StableFace}
}

11 · License

Apache 2.0 for code.

Relevant links:

Weights: https://huggingface.co/ValerianFourel/RealisticEmotionStableDiffusion

HuggingFace Space: https://huggingface.co/spaces/ValerianFourel/StableFaceEmotion

AffectNet Dataset: https://huggingface.co/datasets/chitradrishti/AffectNet

Original MiniGPT-v2 codebase: https://github.com/Vision-CAIR/MiniGPT-4

Medium Article: https://medium.com/@valerian.fourel/stableface-a-stable-diffusion-model-for-faces-with-guidance-on-emotions-4ea9b5dfa29a

Inspired by with Description generated using LLaVA-1.6: https://github.com/haotian-liu/LLaVA

Emoca: https://github.com/radekd91/emoca

About

Stable Diffusion on Faces

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages