Skip to content

markitantov/EmoSen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Lingual Approach for Multimodal Emotion and Sentiment Recognition Based on Triple Fusion

Abstract

Affective states recognition is a challenging task that requires a large amount of input data, such as audio, video, and text. This paper addresses the problem of affective states recognition involving multi-task emotion and sentiment recognition. We consider several unimodal models based on temporal encoders: Transformer-based, Mamba, xLSTM. We propose various multimodal fusion strategies that include double and triple fusion strategies with and without a label encoder. Double fusion strategies involve interaction between two main modalities, while triple fusion strategies handle audio, video, and text modalities equally. Strategies with the label encoder combine emotional and sentiment predictions with deep features. Using three publicly available corpora, RAMAS, MELD, and CMU-MOSEI, we conduct an extensive experimental study using unimodal (audio, video, or text) and multimodal models to comprehensively understand their capabilities and limitations. On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean macro F1-score (mWF) of 88.6%, and macro F1-score (WF) of 84.8% for emotion and sentiment recognition, respectively. On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% and WF of 60.0%, respectively. On the Test subset of the RAMAS corpus, the proposed approach showed WF of 71.8% and WF of 90.0%, respectively. We compare the performance of the approach proposed with that of the SOTA ones.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published