Skip to content

sony/mmaudiosep

Repository files navigation

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

MMAudioSep - Official PyTorch Implementation

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Sony Group Corporation
arXiv

Installation

This software has only been tested on Ubuntu.

Prerequisites

We recommend using a miniforge environment.

  • Python 3.9+
  • PyTorch 2.5.1+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/, pip install recommended)

1. Install prerequisite if not yet met:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade

(Or any other CUDA versions that your GPUs/driver support)

2. Clone our repository:

git clone https://github.com/sony/mmaudiosep.git MMAudioSep

3. Install with pip (install pytorch first before attempting this!):

cd MMAudioSep
pip install -e .

(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)

Pretrained models:

Note: Pretrained models are not yet available, but will be released in an upcoming update. See MODELS.md for more details.

Demo

By default, these scripts use the large_44k model. In our experiments, inference requires approximately 10GB of GPU memory (in 16-bit mode), which should be compatible with most modern GPUs.

Command-line interface

With demo.py

python demo.py --duration=8 --video=<path to video> --audio=<path to mixture audio> --prompt "your prompt" 

The output (audio in .flac format, and video in .mp4 format) will be saved in ./output. See the file for more options. Simply omit the --video option for text-query separation. The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.

Gradio interface

Supports video-query and text-query sound separation. Use port forwarding (e.g., ssh -L 7860:localhost:7860 server) if necessary. The default port is 7860 which you can specify with --port.

python gradio_demo.py

Training

See TRAINING.md.

Evaluation

Note: Our evaluation code is based on av-benchmark (https://github.com/hkchengrex/av-benchmark).
It is already usable with the current setup, and we plan to release some modifications soon to better accommodate our specific use case.

See EVAL.md.

Training Datasets

MMAudioSep was trained on several datasets, including AudioSet, Freesound, VGGSound, AudioCaps, and WavCaps. These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.

Citation

If you find this work useful for your research, please cite our paper as follows:

@article{takahashi2025mmaudiosep,
  title={MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation},
  author={Akira, Takahashi and Shusuke, Takahashi and Mitsufuji, Yuki},
  journal={arXiv preprint arXiv:2510.09065},
  url={https://arxiv.org/abs/2510.09065},
  year={2025}
}

Relevant Repositories

  • MMAudio — This repository serves as the foundation for the main codebase. Our implementation is based on MMAudio, with additional modifications and extensions tailored to our use case.
  • av-benchmark — for benchmarking results.

Acknowledgement

We would like to express our gratitude to:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published