Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning (DSAML)
[Project Website] | [Paper]
Here is the core implementation of the DSAML model in the paper "Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning", which is accepted by the AAAI 25.
- Python >= 3.8.5, < 3.9
- PyTorch >= 2.2.1
conda env create -f environment.yml
pip install -r requirements.txt
We need to download the DEAM dataset and unzip both the audio and annotation files.
Specifically, you need create DEAM_Annotations
and DEAM_audio
folders in the root directory of the dataset root folder, and put the annotation and audio files in the corresponding folders. The final file structure should be like this:
DEAM
├── DEAM_Annotations
│ ├── annotations
├── DEAM_audio
│ ├── MEMD_audio
└── features (This is Optional)
└── features
Then we need to preprocess the dataset, but before we do that, we need to create the .env
file.
After downloading the dataset, you need to create a .env
file in the root directory of the project. The .env
file should contain the following environment variables:
# The directory to save the logs
LOG_DIR="./logs"
# The directory to save the audio embedding for DEAM dataset
AUDIO_EMBEDDING_DIR_NAME="feature_embedding"
# The path to the DEAM dataset
DATASET_PATH="/your/path/to/DEAM"
# The key to the audio input in the dataset, please keep this
AUDIO_INPUT_KEY="log_mel_spectrogram"
You should modify the DATASET_PATH
and PMEMO_DATASET_PATH
to the path where you store the DEAM and PMEmo dataset.
In order to speed up the training process, we need to preprocess the dataset. You can run the following command to preprocess the dataset:
./scripts/dataset.sh
# If you want to use specific GPU, you can add the following command
# CUDA_VISIBLE_DEVICES=1 ./scripts/dataset.sh
This process will take about one hour, depending on your machine.
After the dataset is preprocessed, you can train the model by running the following command:
# For DMER Task
python train.py --device "cuda:0" --not_using_maml
# For PDMER Task
python train.py --device "cuda:0" --using_personalized_data_train --using_personalized_data_validate
For inference, you can use the following code snippet after training the model:
from utils.inference import build_batch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and checkpoint before inference
# model = PDMERModel(device=device).to(device)
# model.load_state_dict(torch.load("path/to/checkpoint.pth"))
audio_file_path_list = [
"/path/to/audio1.wav",
"/path/to/audio2.wav",
]
# Build the input batch.
embedding, _ = build_batch(
audio_file_path_list,
imagebind_model=None, # If there are no ImageBind instances, set it to None, and it will auto load the model
device=device,
)
print("\n Build batch embedding:")
for key, value in embedding.items():
print("\t", key, value.shape)
print("Result:")
output = model(embedding)
print("Arousal: ", output["model_output"][0].shape) # The first element is the arousal prediction, [batch_size, 2 * second]
print("Valence: ", output["model_output"][1].shape) # The second element is the valence prediction, [batch_size, 2 * second]
If you find this code useful in your research, please consider citing:
@misc{zhang2024personalizeddynamicmusicemotion,
title={Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning},
author={Dengming Zhang and Weitao You and Ziheng Liu and Lingyun Sun and Pei Chen},
year={2024},
eprint={2412.19200},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2412.19200},
}