Skip to Main content Skip to Navigation
Conference papers

Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild

Abstract : In this paper, we propose a multimodal deep learning architecturefor emotion recognition in video regarding our participation to theaudio-video based sub-challenge of the Emotion Recognition in theWild 2017 challenge. Our model combines cues from multiple videomodalities, including static facial features, motion patterns relatedto the evolution of the human expression over time, and audio infor-mation. Specifically, it is composed of three sub-networks trainedseparately: the first and second ones extract static visual featuresand dynamic patterns through 2D and 3D Convolutional NeuralNetworks (CNN), while the third one consists in a pretrained audionetwork which is used to extract useful deep acoustic signals fromvideo. In the audio branch, we also apply Long Short Term Memory(LSTM) networks in order to capture the temporal evolution of theaudio features. To identify and exploit possible relationships amongdifferent modalities, we propose a fusion network that merges cuesfrom the different modalities in one representation. The proposed ar-chitecture outperforms the challenge baselines (38.81%and40.47%):we achieve an accuracy of50.39%and49.92%respectively on thevalidation and the testing data.
Complete list of metadata
Contributor : Olfa Ben Ahmed Connect in order to contact the contributor
Submitted on : Wednesday, March 13, 2019 - 9:48:03 AM
Last modification on : Friday, August 9, 2019 - 11:42:05 AM

Links full text




Stefano Pini, Olfa Ben Ahmed, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, et al.. Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild. the 19th ACM International Conference, Nov 2017, Glasgow, France. pp.536-543, ⟨10.1145/3136755.3143006⟩. ⟨hal-02065973⟩



Les métriques sont temporairement indisponibles