High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model
Mingtao Guo1† Guanyu Xing2‡ Yanli Liu1,3▽
1 National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China
2 School of Cyber Science and Engineering, Sichuan University, Chengdu, China
3 College of Computer Science, Sichuan University, Chengdu, China
† First author ‡ Second author ▽ Corresponding author
- [2025.03.02] Our pre-trained model is out on HuggingFace!
- [2025.02.27] ⭐ Exciting News! Relightable-Portrait-Animation got accepted by CVPR 2025!
We are going to make all the following contents available:
- Model inference code
- Model checkpoint
- Training code
- Data processing code
- Clone this repo locally:
git clone https://github.com/MingtaoGuo/Relightable-Portrait-Animation
cd Relightable-Portrait-Animation
- Install the dependencies:
conda create -n relipa python=3.8
conda activate relipa
- Install packages for inference:
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install --extra-index-url https://miropsota.github.io/torch_packages_builder pytorch3d==0.7.7+pt2.2.2cu121
pip install -r requirements.txt
mkdir pretrained_weights
mkdir pretrained_weights/relipa
git-lfs install
git clone https://huggingface.co/MartinGuo/Relightable-Portrait-Animation
mv Relightable-Portrait-Animation/ref_embedder.pth pretrained_weights/relipa
mv Relightable-Portrait-Animation/light_embedder.pth pretrained_weights/relipa
mv Relightable-Portrait-Animation/head_embedder.pth pretrained_weights/relipa
mv Relightable-Portrait-Animation/unet.pth pretrained_weights/relipa
mv Relightable-Portrait-Animation/data src/decalib
mv Relightable-Portrait-Animation/u2net_human_seg.pth src/facematting
git clone https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt
mv stable-video-diffusion-img2vid-xt pretrained_weights
git clone https://huggingface.co/stabilityai/sd-vae-ft-mse
mv sd-vae-ft-mse pretrained_weights/stable-video-diffusion-img2vid-xt
The weights will be put in the ./pretrained_weights
directory. Heads up! The whole downloading process could take quite a long time.
Finally, these weights should be orgnized as follows:
./pretrained_weights/
|-- relipa
| |-- unet.pth
| |-- ref_embedder.pth
| |-- light_embedder.pth
| |-- head_embedder.pth
|-- stable-video-diffusion-img2vid-xt
|-- sd-vae-ft-mse
| |-- config.json
| |-- diffusion_pytorch_model.bin
|-- feature_extractor
| |-- preprocessor_config.json
|-- scheduler
| |-- scheduler_config.json
|-- model_index.json
|-- unet
| |-- config.json
| |-- diffusion_pytorch_model.safetensors
| |-- diffusion_pytorch_model.fp16.safetensors
|-- image_encoder
| |-- config.json
| |-- model.safetensors
| |-- model.fp16.safetensors
Here's the command to run preprocess scripts: Use DECA to extract the pose from the driving video and the mesh from the reference portrait, then render shading hints by combining the driving video's pose, the reference portrait's mesh, and the target lighting.
python preprocess.py --video_path resources/WDA_DebbieDingell1_000.mp4 --source_path resources/reference.png --light_path resources/target_lighting1.png --save_path resources/shading.mp4 --motion_align relative
After running preprocess.py
you'll get the results:
Here's the command to run inference scripts: Guide our model with the shading hints obtained from preprocessing to generate results where the pose is consistent with that of the driving video, the identity is consistent with the reference image, and the lighting is consistent with the target lighting.
python inference.py --pretrained_model_name_or_path pretrained_weights/stable-video-diffusion-img2vid-xt --checkpoint_path pretrained_weights/relipa/ --video_path resources/shading.mp4 --save_path result.mp4 --guidance 4.5 --inference_steps 25 --driving_mode relighting
After running inference.py
you'll get the results:
python train.py --pretrained_model_name_or_path pretrained_weights/stable-video-diffusion-img2vid-xt \
--height 512 --width 512 --num_frames 16 --validation_steps 100 --max_train_steps 30000 \
--gradient_accumulation_steps 8 --gradient_checkpointing True --learning_rate 1e-5 --use_8bit_adam True \
--sample_rate 4 --num_workers 2 --checkpointing_steps 1000 --checkpoints_total_limit 2 \
--data_meta_path TalkingHeadVideo/VFHQ/VFHQ-data-consistent.json
VFHQ-video | VFHQ-kpmap | VFHQ-mesh | VFHQ-mask |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Training dataset
./TalkingHeadVideo/
|-- VFHQ
|-- VFHQ-mask
|-- Clip+zZEv-ATOpoY+P0+C2+F3168-3532_10369.mp4
...
|-- VFHQ-kpmap
|-- VFHQ-video
|-- VFHQ-mesh
VFHQ-data-consistent.json
|-- CelebV-HQ
|-- CelebV-HQ-mask
|-- __lRwnjxeCg_1.mp4
...
|-- CelebV-HQ-kpmap
|-- CelebV-HQ-video
|-- CelebV-HQ-mesh
CelebV-HQ-data-consistent.json
We first thank to the contributors to the StableVideoDiffusion, Echomimic and MimicMotion repositories, for their open research and exploration. Furthermore, our repo incorporates some codes from DECA, MediaPipe and U2Net, and we extend our thanks to them as well.
If you use this model in your research, please consider citing:
@misc{guo2025highfidelityrelightablemonocularportrait,
title={High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model},
author={Mingtao Guo and Guanyu Xing and Yanli Liu},
year={2025},
eprint={2502.19894},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.19894},
}
This project is licensed under the MIT License. See the LICENSE file for details.