Affective states recognition is a challenging task that requires a large amount of input data, such as audio, video, and text. This paper addresses the problem of affective states recognition involving multi-task emotion and sentiment recognition. We consider several unimodal models based on temporal encoders: Transformer-based, Mamba, xLSTM. We propose various multimodal fusion strategies that include double and triple fusion strategies with and without a label encoder. Double fusion strategies involve interaction between two main modalities, while triple fusion strategies handle audio, video, and text modalities equally. Strategies with the label encoder combine emotional and sentiment predictions with deep features. Using three publicly available corpora, RAMAS, MELD, and CMU-MOSEI, we conduct an extensive experimental study using unimodal (audio, video, or text) and multimodal models to comprehensively understand their capabilities and limitations. On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean macro F1-score (mWF) of 88.6%, and macro F1-score (WF) of 84.8% for emotion and sentiment recognition, respectively. On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% and WF of 60.0%, respectively. On the Test subset of the RAMAS corpus, the proposed approach showed WF of 71.8% and WF of 90.0%, respectively. We compare the performance of the approach proposed with that of the SOTA ones.
-
Notifications
You must be signed in to change notification settings - Fork 0
markitantov/EmoSen
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published