Toward an audiovisual attention model for multimodal video content

Abstract : Visual attention modeling is a very active research field and several image and video attention models have been proposed during the last decade. However, despite the conclusions drawn from various studies about the influence of human gazes by the presence of sound, most of the classical video attention models do not account for the multimodal nature of video (visual and auditory cues). In this paper, we propose an audiovisual saliency model with the aim to predict human gaze maps when exploring video content. The model, intended for videoconferencing, is based on the fusion of spatial, temporal and auditory attentional maps. Based on a real-time audiovisual speaker localization approach, the proposed auditory map is modulated depending of the nature of faces in the video, i.e. speaker or auditor. State-of-the-art performance measures have been used to compare the predicted saliency maps with the eye-tracking ground truth. The obtained results show the very good performance of the proposed model and a significant improvement compared to non-audio models.
Type de document :
Article dans une revue
Neurocomputing, Elsevier, 2017, 259, pp.94 - 111. 〈10.1016/j.neucom.2016.08.130〉
Liste complète des métadonnées
Contributeur : Mohamed-Chaker Larabi <>
Soumis le : mercredi 24 août 2016 - 15:32:26
Dernière modification le : mardi 29 août 2017 - 16:17:42




Naty Sidaty, Mohamed-Chaker Larabi, Abdelhakim Saadane. Toward an audiovisual attention model for multimodal video content. Neurocomputing, Elsevier, 2017, 259, pp.94 - 111. 〈10.1016/j.neucom.2016.08.130〉. 〈hal-01355968〉



Consultations de la notice