Audio Dataset Instruction

Stage1: Pretraining

We mainly use WavCaps dataset for pre-training.

Download

# install git-lfs
sudo apt update
sudo apt-get install git-lfs


git clone https://huggingface.co/datasets/cvssp/WavCaps
cd WavCaps
git lfs pull --include "*"

Processing

Extract zip file

# merge shards first
zip -s- FILE_NAME.zip -O COMBINED_FILE.zip
unzip COMBINED_FILE.zip

Processing Extract raw audio data

unzip COMBINED_FILE.zip -d /target/dir

Create json files (annotations) for each example. Before processing, modify dataset/audio/process.py to set data and json path.

python3 --dataset test --data_dir /path/to/data --json_path /path/to/json

Pack with tar

python3 dataset/audio/make_tar.py --input /path/to/data --output /path/to/web_dataset \
    --dataclass none --filename filename --num_element 500

To view tar file

tar tf filename.tar | sed 10q

To setup in one line:

# DATASET=soundbible bbc audioset freesound
DATASET=soundbible bash dataset/audio/setup.sh

Stage2: Instruction Tuning

We use Clotho as the base corpus to construct our instruction tuning dataset Clotho-Detail.

Download

Access Clotho Dataset to download the clotho_audio_development.7z and clotho_audio_evaluation.7z audio files.
Download the generated annotation file Clotho-Detail.

Processing

Unzip the files above and merge all the audios into a single folder audio. As a result, there should be 3,939 audios contained in the folder.
Put the annotation file under the same file hierarchical level as the audio folder, like:

clotho
├─ Clotho-detail-annotation.json
├─ audio
├─── 00294 harvest festival rumour 1.wav
├─── 00332 lake beach 1.wav
├─── ...

Edit the path and name configuration in the corresponding files accordingly.

Image-Audio Dataset Instruction

Part 1: Aligned Audio-Image Data

We use VGGSS as the base data to construct our training corpus in the process of multi-modality instruction tuning.

To explore and exploit this corpus, please:

Follow the github page and project page of VGGSS to prepare the audio and image data into the audio and image folders.
Download our refactored annotation file VGGSS-Instruction.
Put the annotation file under the same file hierarchical level of audio and image folders, like:

VGGSS
├─ audio
├─── 007P6bFgRCU_000010.wav
├─── 00QQLLcny14_000083.wav
├─ image
├─── 007P6bFgRCU_000010.jpg
├─── 00QQLLcny14_000083.jpg
├─ vggss-instruction-tuning.json

Edit the path and name configuration in the corresponding files accordingly.

Part 2: Unaligned Audio-Image Data

The unaligned audio-image data can be collected by pairing arbitrary image and audio data from different datasets. Please refer to the config of negatively paired audio-image dataset and modify the configuration accordingly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Audio Dataset Instruction

Stage1: Pretraining

Download

Processing

Stage2: Instruction Tuning

Download

Processing

Image-Audio Dataset Instruction

Part 1: Aligned Audio-Image Data

Part 2: Unaligned Audio-Image Data

Files

README.md

Latest commit

History

README.md

File metadata and controls

Audio Dataset Instruction

Stage1: Pretraining

Download

Processing

Stage2: Instruction Tuning

Download

Processing

Image-Audio Dataset Instruction

Part 1: Aligned Audio-Image Data

Part 2: Unaligned Audio-Image Data