We mainly use WavCaps dataset for pre-training.
# install git-lfs
sudo apt update
sudo apt-get install git-lfs
git clone https://huggingface.co/datasets/cvssp/WavCaps
cd WavCaps
git lfs pull --include "*"
- Extract zip file
# merge shards first
zip -s- FILE_NAME.zip -O COMBINED_FILE.zip
unzip COMBINED_FILE.zip
- Processing Extract raw audio data
unzip COMBINED_FILE.zip -d /target/dir
Create json files (annotations) for each example. Before processing, modify dataset/audio/process.py
to set data and json path.
python3 --dataset test --data_dir /path/to/data --json_path /path/to/json
- Pack with tar
python3 dataset/audio/make_tar.py --input /path/to/data --output /path/to/web_dataset \
--dataclass none --filename filename --num_element 500
To view tar file
tar tf filename.tar | sed 10q
To setup in one line:
# DATASET=soundbible bbc audioset freesound
DATASET=soundbible bash dataset/audio/setup.sh
We use Clotho as the base corpus to construct our instruction tuning dataset Clotho-Detail.
- Access Clotho Dataset to download the clotho_audio_development.7z and clotho_audio_evaluation.7z audio files.
- Download the generated annotation file Clotho-Detail.
- Unzip the files above and merge all the audios into a single folder audio. As a result, there should be 3,939 audios contained in the folder.
- Put the annotation file under the same file hierarchical level as the audio folder, like:
clotho
├─ Clotho-detail-annotation.json
├─ audio
├─── 00294 harvest festival rumour 1.wav
├─── 00332 lake beach 1.wav
├─── ...
- Edit the path and name configuration in the corresponding files accordingly.
We use VGGSS as the base data to construct our training corpus in the process of multi-modality instruction tuning.
To explore and exploit this corpus, please:
- Follow the github page and project page of VGGSS to prepare the audio and image data into the audio and image folders.
- Download our refactored annotation file VGGSS-Instruction.
- Put the annotation file under the same file hierarchical level of audio and image folders, like:
VGGSS
├─ audio
├─── 007P6bFgRCU_000010.wav
├─── 00QQLLcny14_000083.wav
├─ image
├─── 007P6bFgRCU_000010.jpg
├─── 00QQLLcny14_000083.jpg
├─ vggss-instruction-tuning.json
- Edit the path and name configuration in the corresponding files accordingly.
The unaligned audio-image data can be collected by pairing arbitrary image and audio data from different datasets. Please refer to the config of negatively paired audio-image dataset and modify the configuration accordingly.