[2024.2.20] Our MMAD has been accepted at COLING 2024! Welcome to watch 👀 this repository for the latest updates.
MMAD exhibits remarkable AD generation capabilities in movies, by utilizing multiple modal inputs.
- You are required to install the dependencies. If you have conda installed, you can run the following:
git clone https://github.com/Daria8976/MMAD.git
cd MMAD
bash environment.sh
- Download weights from pretrained model:
- checkpoint_step_50000.pth under
checkpoint
folder - base.pth under
AudioEnhancing/configs
folder - LanguageBind/Video-LLaVA-7B under
VideoCaption
folder
-
prepare REPLICATE_API_TOKEN in
llama.py
-
Prepare demo data (We provide four demo video here):
- put demo.mp4 under
Example/Video
- put [character photo] (Photos should be named with the corresponding character name) under
Example/ActorCandidate
python infer.py
Finally, we organized 10 vision health volunteers, 10 BVI people (including 3 totally blind and 7 partially sighted) for human evaluation via Likert scale, and we merged the statistical results into the result table of the paper.
@inproceedings{ye2024mmad,
title={MMAD: Multi-modal Movie Audio Description},
author={Ye, Xiaojun and Chen, Junhao and Li, Xiang and Xin, Haidong and Li, Chao and Zhou, Sheng and Bu, Jiajun},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages={11415--11428},
year={2024}
}
Here are some great resources we benefit or utilize from:
- Video-LLaVA and Pengi for Our code base.