Thanks to recent advances in deep learning based algorithms, humanoid social robots are increasingly exhibiting human-like behaviors. In this context, the analysis of soft biometrics, particularly emotion recognition, is crucial for enhancing communication between social robots and humans, facilitating emotion-aware dialogues. In light of these considerations, we propose a multimodal emotion recognition system tailored for social robotics applications. The system processes both video and audio data to classify the emotion in one among six different classes, employing 3D convolutional operations that eliminate the need for transformer-based architecture, effectively reducing the model’s size and making the network able to run over low power embedded devices mounted directly on board of the robot. The proposed approach was trained on the CREMA-D dataset and demonstrates impressive performance when compared to video-only and audio-only counterparts, outperforming state-of-the-art methods both unimodal and multimodal.

Multimodal Audio-Visual Emotion Recognition for Social Robotics

Giuseppe De Simone
;
Luca Greco;Alessia Saggese;Mario Vento
2025

Abstract

Thanks to recent advances in deep learning based algorithms, humanoid social robots are increasingly exhibiting human-like behaviors. In this context, the analysis of soft biometrics, particularly emotion recognition, is crucial for enhancing communication between social robots and humans, facilitating emotion-aware dialogues. In light of these considerations, we propose a multimodal emotion recognition system tailored for social robotics applications. The system processes both video and audio data to classify the emotion in one among six different classes, employing 3D convolutional operations that eliminate the need for transformer-based architecture, effectively reducing the model’s size and making the network able to run over low power embedded devices mounted directly on board of the robot. The proposed approach was trained on the CREMA-D dataset and demonstrates impressive performance when compared to video-only and audio-only counterparts, outperforming state-of-the-art methods both unimodal and multimodal.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4918315
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact