Left → Right: Original AffectNet photo • Baseline Realistic-Vision output • StableFaceEmotion finetuned output
Description: Happy, Close-up woman's face, smiling, long dark hair, fair skin, youthful appearance, light-colored eyes, well-defined eyebrows, straight nose, full lips, makeup, softened lighting, neutral background, facial features focused
Fine-tune Stable Diffusion 1.5 to generate photorealistic faces with controllable emotions.
The project extends the checkpoint SG161222/Realistic_Vision_V6.0_B1_noVAE with
- a lightweight multi-modal guidance stack (depth, landmarks, FLAME render),
- a composite loss (L₁ + LPIPS + EmoNet),
- and large-scale balanced AffectNet supervision.
# Clone
git clone https://github.com/ValerianFourel/StableFace.git
cd StableFace
# Install (Python ≥ 3.10)
pip install -r requirements.txt
pip install --extra-index-url https://download.pytorch.org/whl/cu118 \
torch==2.2.0+cu118 torchvision==0.17.0+cu118
# Launch (single GPU)
accelerate launch train_lpips_emonet_text_to_image.pyaccelerate --multi_gpu launch train_lpips_emonet_text_to_image.pyThe default accelerate config uses DistributedDataParallel and gradient-accumulation to reach an effective batch size ≈ 1024.
| Component | Minimum | Notes |
|---|---|---|
| GPU | NVIDIA A100-SXM4-80 GB | 2 × 80 GB tested |
| GPU | NVIDIA H100 80 GB HBM3 | alternative |
| CUDA | 11.8 | required |
| vRAM | 160 GB total | LPIPS feature maps |
HTCondor snippets:
-
A100
condor_submit_bid 1000 -i \ -append request_memory=281920 \ -append request_cpus=10 \ -append request_disk=100G \ -append request_gpus=2 \ -append 'requirements = CUDADeviceName == "NVIDIA A100-SXM4-80GB"' -
H100
condor_submit_bid 1000 -i \ -append request_memory=281920 \ -append request_cpus=10 \ -append request_disk=100G \ -append request_gpus=2 \ -append 'requirements = CUDADeviceName == "NVIDIA H100 80GB HBM3"'
Per-GPU batch presets
- L₁ + LPIPS → 4
- L₁ only → 8
EmocaProcessed_38k/
├─ geometry_detail/ # FLAME renders
└─ inputs/ # cropped faces
38 000 AffectNet images balanced across 8 Ekman emotions.
Upcoming (≥ 2024-10-13): renders pasted on original canvas for depth/semantic/skeleton supervision.
Helpful tools
- Depth: isl-org/ZoeDepth
- Alignment: 1adrianb/face-alignment
cp -r feature_extractor model_index.json safety_checker \
scheduler text_encoder tokenizer vae \
../AllGuidances_2-sd-model-finetuned-l192_lpips08-emonet08-snr08-lr56-1024pics_224res/checkpoint-176/
cp -r unet/* \
../AllGuidances_2-sd-model-finetuned-l192_lpips08-emonet08-snr08-lr56-1024pics_224res/checkpoint-176/unet/| Term | Weight | Purpose |
|---|---|---|
| L₁ | 0.92 | Pixel fidelity |
| LPIPS | 0.08 | Perceptual realism |
| EmoNet Valence | 0.03 | Affective intensity |
| EmoNet Arousal | 0.03 | Affective intensity |
| EmoNet Expression | 0.02 | Discrete class |
- Depth map
- 2-D landmarks
- FLAME mesh render
All encoded by a small transformer → preserves identity & head pose while changing expression.
The repository ships with a ready-to-use script that
- loads both the original Realistic-Vision checkpoint and your fine-tuned StableFaceEmotion weights,
- synthesises an image for every prompt found in a JSON validation file,
- builds a “triptych” (original photo ➜ base SD image ➜ fine-tuned SD image) and saves it to disk.
If you only want to generate triptychs, use the lightweight inference wrapper:
python inference.py \
--config ./configs/inference/flame_emonet_validation.yamlThe YAML config exposes:
pretrained_model_name_or_path– base checkpoint (e.g.SG161222/Realistic_Vision_V6.0_B1_noVAE)finetuned_model– path or HF repo of the StableFaceEmotion weightsnegative_prompt/negative_prompt2– long-form negative prompts already embedded in the scriptseed– set to reproduce identical outputsvalidation_dict– JSON mapping image-path ➜ prompt (used to build triptychs)output_folder– where triptychs will be written
The pipeline will automatically:
- download / load the tokenizer, text-encoder, UNet, VAE and guidance encoders,
- disable the safety-checker on the base model (to ensure a fair comparison),
- run DDPM sampling (
num_inference_steps = 300,guidance_scale = 9.0by default), - place the three 512 × 512 images side-by-side with the cleaned prompt as caption.
Once finished, all files live under output_folder/subfolder/filename.png, mirroring the dataset hierarchy.
Run quantitative evaluation on a 300-image validation split:
python validation_finetuning_Emotions.py \
--cfg-path train_configs/minigptv2_finetune_gpt4vision_Full.yaml \
--image-dir /fast/vfourel/FaceGPT/Data/StableFaceData/AffectNet41k_FlameRender_Descriptions_Images/affectnet_41k_AffectOnly/Manually_Annotated/Manually_Annotated_Images \
--ground-truth /fast/vfourel/FaceGPT/Data/StableFaceData/AffectNet41k_FlameRender_Descriptions_Images/affectnet_41k_AffectOnly/EmocaProcessed_38k/Modified_Corpus_300_validation.jsonThe script reports FID, DISTS and EmoNet Top-k accuracy.
Each triptych shows: left → original AffectNet photo, middle → baseline Realistic-Vision output, right → StableFaceEmotion output.
| Original → Baseline → Finetune | Original → Baseline → Finetune | Original → Baseline → Finetune |
![]() |
![]() |
![]() |
These visuals highlight sharper textures and noticeably improved emotion fidelity after fine-tuning.
| Metric | Base RV6.0-B1 | StableFaceEmotion | Δ |
|---|---|---|---|
| FID ↓ | 106.0 | 84.4 | −21.6 |
| DISTS ↓ | 0.329 | 0.320 | −2.6 % |
| EmoNet Top-1 ↑ | 31 % | 39 % | +8 pp |
| EmoNet Top-3 ↑ | 62 % | 72 % | +10 pp |
Largest gains: anger, disgust, surprise.
---- Weights: ValerianFourel/RealisticEmotionStableDiffusion
- Demo: HF Space
@misc{fourel2025stablefaceemotion,
title = {StableFaceEmotion: Structure-Aware Emotion Control for Stable Diffusion},
author = {Valérian Fourel},
year = {2025},
url = {https://github.com/ValerianFourel/StableFace}
}Apache 2.0 for code.
Weights: https://huggingface.co/ValerianFourel/RealisticEmotionStableDiffusion
HuggingFace Space: https://huggingface.co/spaces/ValerianFourel/StableFaceEmotion
AffectNet Dataset: https://huggingface.co/datasets/chitradrishti/AffectNet
Original MiniGPT-v2 codebase: https://github.com/Vision-CAIR/MiniGPT-4
Medium Article: https://medium.com/@valerian.fourel/stableface-a-stable-diffusion-model-for-faces-with-guidance-on-emotions-4ea9b5dfa29a
Inspired by with Description generated using LLaVA-1.6: https://github.com/haotian-liu/LLaVA







