Skip to content

RenatoEsposito1999/Multi-Modal-Emotion-Classification

Repository files navigation

License: GPL v3

Multi-Modal Emotion Classification

Index

  1. Prefase
  2. Project Overview
  3. Project Structure
  4. Datasets
  5. Dependencies
  6. Development environment
  7. Setting up the Environment
  8. How to Run the Project
  9. Test by Yourself
  10. Results
  11. Contributors
  12. References

Prefase

This is an experimental project for the exam of Cognitive Robotics of the University of Naples Parthenope, whose goal is to create a system in which a humanoid reacts to emotions perceived in relation to a human. The perception of emotions occurs thanks to an EEG helmet and audio and video sensors installed in the Pepper robot. This repository refers to the first part of this project which consists in the creation of a model for emotion recognition. The second part, which is not on github, consists in defining behaviors on the robot pepper based on emotions detected by the model. This project is translated into a paper available at this link.

Project Overview

This project focuses on developing a multi-modal emotion classification system combining audio, video and EEG inputs. Two deep learning models and a meta model are integrated to achieve this:

Alt text

  1. Audio-Video Emotion Classification Model Based on the paper "Learning Audio-Visual Emotional Representations with Hybrid Fusion Strategies", this model classifies four emotions using audio and video inputs.

  2. FBCCNN (Feature-Based Convolutional Neural Network) Based on the paper "Emotion Recognition Based on EEG Using Generative Adversarial Nets and Convolutional Neural Network", this model uses EEG data to enhance emotion classification.

  3. Meta-model This model receive as input the predictions of the two deep learning models and through a logistic regression function obtain the final prediction that are: neutral, happy, angry, sad.

Project Structure

project-root
│
├───audio_video_emotion_recognition_model
│   │   
│   ├───datasets       
│   ├───Data_preprocessing        
│   ├───Image      
│   ├───Multimodal_transformer 
│   │   ├───Preprocessing_CNN  
│   │   │   ├───Preprocessing_utils
│   │   │   
│   │   ├───Transformers
│   │           
│   ├───results      
│   ├───utils
│                 
├───EEG_model  
│   ├───datasets   
│   ├───Images     
│   ├───results       
│   ├───utils 
│     
├───envs    
├───Meta_model        
│   ├───results      
└───Shared

Datasets

The dataset used for training of audio-video emotion recognition model is RAVDESS, that can be downloaded here

The dataset used for training the eeg-model is SEED-IV, that can be requested here

Dependencies

The main dependencies are:

  • Python 3.9
  • PyTorch 2.6
  • Torcheeg 1.1.3

All dependencies are specified in the .yml files located in the envs directory.

Development environment

The development was performed in a Linux CentOS environment on the machine made available by the University of Naples Parthenope. The machine is equipped with 8 computational nodes each equipped with 32 cores and 192 Giga Bytes of RAM for a total of 296 CPU cores. 4 of these 8 computational nodes are each equipped with 4 GPUs for a total of 16 NVIDIA V100 NVLINK devices. Each of these GPUs is equipped with 5120 CUDA cores and 32GB of RAM for a total of 81920 GPU cores. The computational nodes are connected to each other through a high-performance network.

Setting up the Environment

At the moment the .yml file for the windows environment is not complete, because some libraries related to the audio-video and eeg models are missing, these libraries can be easily downloaded through conda. The linux environment is complete instead.

To replicate the development environment, you can use Conda. The .yml files required for creating the environment are located in the envs directory.

To create the environment in a windows system, run:

conda env create -f envs/environment_windows.yml
conda activate cognitive_robotics_env

To create the environment in a linux system, run:

conda env create -f envs/environment_linux.yml
conda activate cognitive_robotics_env

How to Run the Project

Models must be individually trained before the meta model can be trained.

  1. Audio-video remotion recognition model:

    cd audio_video_emotion_recognition_model

    Before use the model it's mandatory to perform the preprocessing steps:

    Inside each of three scripts, specify the path (full path!) where you have downloaded the data. Then run:

    cd ravdess_preprocessing
    python extract_faces.py
    python extract_audios.py
    python create_annotations.py

    As a result you will have annotations.txt file that you can use further for training.

    • Training - Validation - Testing:
    python main.py

    If you want to perform just one of those steps add the arguments --no-train or --no-val or --test. For more details see opts file

    • Prediction: (For those who want to try the single model)
    python main.py --no-train --no-val --test --predict
  2. EEG-model:

    • Training - Validation - Testing:
    python main.py --path_eeg [Path of dataset SEED IV]

    If you have the folder of cached preprocessed dataset seed IV, you can specify it with argument --path_cached

    If you want to perform just one of those steps add the arguments --no-train or --no-val or --test. For more details see opts file

  3. Meta model:

    • Training - Testing:
    python main.py --path_eeg [Path of dataset SEED IV]

    If you have the folder of cached preprocessed dataset seed IV, you can specify it with argument --path_cached

    If you want to do only the prediction add the argument --predict

    If you want to perform just one of those steps add the arguments --no-train or --test. For more details see opts file

Test by yourself

If you want to test by yourself you can find the pretrained weights of the models in the results directories of the respective models.

Results

The following metrics are plotted:

  • For training: accuracy and loss.
  • For validation: accuracy and loss.
  • For testing: accuracy, loss, and confusion matrix.

Detailed plots for the audio-video model can be found in the audio_video_emotion_recognition_model/Image directory, while plots for the EEG model are available in the EEG_model/Images directory.

For the meta-model you can visualize in the Meta_model/Images the confusion matrix computed using the test set of audio-video and eeg.

Contributors

References

  1. Esposito, R., Mele, V., Verrilli, S., Minopoli, S., D'Errico, L., De Santis, L., & Staffa, M. (2025). Cascade Multi-Modal Emotion Recognition Leveraging Audio-Video and EEG Signals. In: Proceedings of EMPATH-IA 2025: EMpowering PAtients THrough AI, Multimedia, and Explainable HCI 2025. CEUR Workshop Proceedings, vol. 4040, CEUR-WS.org. ISSN 1613-0073.
  2. Learning Audio-Visual Emotional Representations with Hybrid Fusion Strategies
  3. Emotion Recognition Based on EEG Using Generative Adversarial Nets and Convolutional Neural Network
  4. Github of Learning Audio-Visual Emotional Representations with Hybrid Fusion Strategies