Toward an audiovisual attention model for multimodal video content
Résumé
Visual attention modeling is a very active research field and several image and video attention models have been proposed during the last decade. However, despite the conclusions drawn from various studies about the influence of human gazes by the presence of sound, most of the classical video attention models do not account for the multimodal nature of video (visual and auditory cues). In this paper, we propose an audiovisual saliency model with the aim to predict human gaze maps when exploring video content. The model, intended for videoconferencing, is based on the fusion of spatial, temporal and auditory attentional maps. Based on a real-time audiovisual speaker localization approach, the proposed auditory map is modulated depending of the nature of faces in the video, i.e. speaker or auditor. State-of-the-art performance measures have been used to compare the predicted saliency maps with the eye-tracking ground truth. The obtained results show the very good performance of the proposed model and a significant improvement compared to non-audio models.