Naming multi-modal clusters to identify persons in TV broadcast

Persons' identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool; these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2% for speaker identification and 60.2% for face identification. Adding few biometric models improves results and leads to 82.4% and 65.6% for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8% F-measure, while 908 face models provide only 30.5% F-measure.

Mots clés

TV broadcast Face and speaker identification Multimodal fusion VideoOCR

Domaines

Recherche d'information [cs.IR]

Georges Quénot : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01230628

Soumis le : mercredi 18 novembre 2015-17:16:25

Dernière modification le : lundi 15 avril 2024-11:25:23

Dates et versions

hal-01230628 , version 1 (18-11-2015)

Identifiants

HAL Id : hal-01230628 , version 1
DOI : 10.1007/s11042-015-2723-1

Citer

Johann Poignant, Guillaume Fortier, Laurent Besacier, Georges Quénot. Naming multi-modal clusters to identify persons in TV broadcast. Multimedia Tools and Applications, 2016, 75 (15), pp.8999-9023. ⟨10.1007/s11042-015-2723-1⟩. ⟨hal-01230628⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_TDCGE_GETALP LIG_TDCGE_MRIM POLYTECH-GRENOBLE ANR LIG_SIDCH

161 Consultations

0 Téléchargements