Skip to Main content Skip to Navigation
Conference papers

A conditional random field approach for audio-visual people diarization

Abstract : We investigate the problem of audiovisual (AV) person di-arization in broadcast data. That is, automatically associate the faces and voices of people and determine when they appear or speak in the video. The contributions are twofolds. First, we formulate the problem within a novel CRF framework that simultaneously performs the AV association of voices and face clusters to build AV person models, and the joint segmentation of the audio and visual streams using a set of AV cues and their association strength. Secondly, we use for this AV association strength a score that does not only rely on lips activity, but also on contextual visual information (face size, position, number of detected faces,.. .) that leads to more reliable association measures. Experiments on 6 hours of broadcast data show that our framework is able to improve the AV-person diarization especially for speaker segments erroneously labeled in the mono-modal case.
Document type :
Conference papers
Complete list of metadata

Cited literature [21 references]  Display  Hide  Download
Contributor : sylvain meignier Connect in order to contact the contributor
Submitted on : Saturday, April 1, 2017 - 12:44:41 AM
Last modification on : Tuesday, December 8, 2020 - 9:44:14 AM
Long-term archiving on: : Sunday, July 2, 2017 - 12:20:16 PM


Files produced by the author(s)




Paul Gay, Elie Khoury, Sylvain Meignier, Jean-Marc Odobez, Paul Deléglise. A conditional random field approach for audio-visual people diarization. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), 2014, Florence, Italy. pp.116 - 120, ⟨10.1109/ICASSP.2014.6853569⟩. ⟨hal-01433223⟩



Record views


Files downloads