MMAD: Multi-modal Movie Audio Description

If you like our project, please give us a star ⭐ on GitHub for latest update.

📰 News

[2024.2.20] Our MMAD has been accepted at COLING 2024! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

MMAD exhibits remarkable AD generation capabilities in movies, by utilizing multiple modal inputs.

🎥 Demo


The talented pianist, 1900, mesmerized the audience with his virtuosic performance of "Christmas Eve" while wearing a pristine white tuxedo and bow tie.	Chris Gardner, a man with a box in his hand, runs frantically through the city, dodging people and cars while being chased by a taxi driver who is honking.

Dancing in the rain, Don Lockwood twirls with joy, umbrella in hand, amidst city streets.	Alice fled through the mushroom forest, her heart racing as the Bandersnatch's ominous hisses and growls echoed behind her.

🛠️ Installation

You are required to install the dependencies. If you have conda installed, you can run the following:

git clone https://github.com/Daria8976/MMAD.git
cd MMAD
bash environment.sh

Download weights from pretrained model:

checkpoint_step_50000.pth under checkpoint folder
base.pth under AudioEnhancing/configs folder
LanguageBind/Video-LLaVA-7B under VideoCaption folder

prepare REPLICATE_API_TOKEN in llama.py
Prepare demo data (We provide four demo video here):

put demo.mp4 under Example/Video
put [character photo] (Photos should be named with the corresponding character name) under Example/ActorCandidate

💡 Inference

python infer.py

🚀 Main Results

Finally, we organized 10 vision health volunteers, 10 BVI people (including 3 totally blind and 7 partially sighted) for human evaluation via Likert scale, and we merged the statistical results into the result table of the paper.

📜 Cite

@inproceedings{ye2024mmad,
  title={MMAD: Multi-modal Movie Audio Description},
  author={Ye, Xiaojun and Chen, Junhao and Li, Xiang and Xin, Haidong and Li, Chao and Zhou, Sheng and Bu, Jiajun},
  booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  pages={11415--11428},
  year={2024}
}

Acknowledgements

Here are some great resources we benefit or utilize from:

Video-LLaVA and Pengi for Our code base.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MMAD: Multi-modal Movie Audio Description

📰 News

😮 Highlights

🎥 Demo

🛠️ Installation

💡 Inference

🚀 Main Results

📜 Cite

Acknowledgements

⭐️ Star History

Files

README.md

Latest commit

History

README.md

File metadata and controls

MMAD: Multi-modal Movie Audio Description

📰 News

😮 Highlights

🎥 Demo

🛠️ Installation

💡 Inference

🚀 Main Results

📜 Cite

Acknowledgements

⭐️ Star History