MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

MMAudioSep - Official PyTorch Implementation

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Sony Group Corporation
arXiv

Installation

This software has only been tested on Ubuntu.

Prerequisites

We recommend using a miniforge environment.

Python 3.9+
PyTorch 2.5.1+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)

1. Install prerequisite if not yet met:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade

(Or any other CUDA versions that your GPUs/driver support)

2. Clone our repository:

git clone https://github.com/sony/mmaudiosep.git MMAudioSep

3. Install with pip (install pytorch first before attempting this!):

cd MMAudioSep
pip install -e .

(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)

Pretrained models:

Note: Pretrained models are not yet available, but will be released in an upcoming update. See MODELS.md for more details.

Demo

By default, these scripts use the large_44k model. In our experiments, inference requires approximately 10GB of GPU memory (in 16-bit mode), which should be compatible with most modern GPUs.

Command-line interface

With demo.py

python demo.py --duration=8 --video=<path to video> --audio=<path to mixture audio> --prompt "your prompt"

The output (audio in .flac format, and video in .mp4 format) will be saved in ./output. See the file for more options. Simply omit the --video option for text-query separation. The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.

Gradio interface

Supports video-query and text-query sound separation. Use port forwarding (e.g., ssh -L 7860:localhost:7860 server) if necessary. The default port is 7860 which you can specify with --port.

python gradio_demo.py

Training

See TRAINING.md.

Evaluation

Note: Our evaluation code is based on av-benchmark (https://github.com/hkchengrex/av-benchmark).
It is already usable with the current setup, and we plan to release some modifications soon to better accommodate our specific use case.

See EVAL.md.

Training Datasets

MMAudioSep was trained on several datasets, including AudioSet, Freesound, VGGSound, AudioCaps, and WavCaps. These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.

Citation

If you find this work useful for your research, please cite our paper as follows:

@article{takahashi2025mmaudiosep,
  title={MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation},
  author={Akira, Takahashi and Shusuke, Takahashi and Mitsufuji, Yuki},
  journal={arXiv preprint arXiv:2510.09065},
  url={https://arxiv.org/abs/2510.09065},
  year={2025}
}

Relevant Repositories

MMAudio — This repository serves as the foundation for the main codebase. Our implementation is based on MMAudio, with additional modifications and extensions tailored to our use case.
av-benchmark — for benchmarking results.

Acknowledgement

We would like to express our gratitude to:

Make-An-Audio 2 for the 16kHz BigVGAN pretrained model and the VAE architecture
BigVGAN
Synchformer
EDM2 for the magnitude-preserving VAE network architecture

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
docs		docs
mmaudio		mmaudio
mmaudiosep		mmaudiosep
sets		sets
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_eval.py		batch_eval.py
demo.py		demo.py
gradio_demo.py		gradio_demo.py
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

MMAudioSep - Official PyTorch Implementation

Installation

Prerequisites

Demo

Command-line interface

Gradio interface

Training

Evaluation

Training Datasets

Citation

Relevant Repositories

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

sony/mmaudiosep

Folders and files

Latest commit

History

Repository files navigation

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

MMAudioSep - Official PyTorch Implementation

Installation

Prerequisites

Demo

Command-line interface

Gradio interface

Training

Evaluation

Training Datasets

Citation

Relevant Repositories

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages