MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Sony Group Corporation
arXiv
This software has only been tested on Ubuntu.
We recommend using a miniforge environment.
- Python 3.9+
- PyTorch 2.5.1+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)
1. Install prerequisite if not yet met:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
(Or any other CUDA versions that your GPUs/driver support)
2. Clone our repository:
git clone https://github.com/sony/mmaudiosep.git MMAudioSep
3. Install with pip (install pytorch first before attempting this!):
cd MMAudioSep
pip install -e .
(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
Pretrained models:
Note: Pretrained models are not yet available, but will be released in an upcoming update. See MODELS.md for more details.
By default, these scripts use the large_44k
model.
In our experiments, inference requires approximately 10GB of GPU memory (in 16-bit mode), which should be compatible with most modern GPUs.
With demo.py
python demo.py --duration=8 --video=<path to video> --audio=<path to mixture audio> --prompt "your prompt"
The output (audio in .flac
format, and video in .mp4
format) will be saved in ./output
.
See the file for more options.
Simply omit the --video
option for text-query separation.
The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
Supports video-query and text-query sound separation.
Use port forwarding (e.g., ssh -L 7860:localhost:7860 server
) if necessary. The default port is 7860
which you can specify with --port
.
python gradio_demo.py
See TRAINING.md.
Note: Our evaluation code is based on av-benchmark (https://github.com/hkchengrex/av-benchmark).
It is already usable with the current setup, and we plan to release some modifications soon to better accommodate our specific use case.
See EVAL.md.
MMAudioSep was trained on several datasets, including AudioSet, Freesound, VGGSound, AudioCaps, and WavCaps. These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.
If you find this work useful for your research, please cite our paper as follows:
@article{takahashi2025mmaudiosep,
title={MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation},
author={Akira, Takahashi and Shusuke, Takahashi and Mitsufuji, Yuki},
journal={arXiv preprint arXiv:2510.09065},
url={https://arxiv.org/abs/2510.09065},
year={2025}
}
- MMAudio — This repository serves as the foundation for the main codebase. Our implementation is based on MMAudio, with additional modifications and extensions tailored to our use case.
- av-benchmark — for benchmarking results.
We would like to express our gratitude to:
- Make-An-Audio 2 for the 16kHz BigVGAN pretrained model and the VAE architecture
- BigVGAN
- Synchformer
- EDM2 for the magnitude-preserving VAE network architecture