Naming multi-modal clusters to identify persons in TV broadcast - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue Multimedia Tools and Applications Année : 2016

Naming multi-modal clusters to identify persons in TV broadcast

Résumé

Persons' identification in TV broadcast is one of the main tools to index this type of videos. The classical way is to use biometric face and speaker models, but, to cover a decent number of persons, costly annotations are needed. Over the recent years, several works have proposed to use other sources of names for identifying people, such as pronounced names and written names. The main idea is to form face/speaker clusters based on their similarities and to propagate these names onto clusters. In this paper, we propose a method to take advantage of written names during the diarization process, in order to both name clusters and prevent the fusion of two clusters named differently. First, we extract written names with the LOOV tool; these names are associated to their co-occurring speaker turns / face tracks. Simultaneously, we build a multi-modal matrix of distances between speaker turns and face tracks. Then agglomerative clustering is performed on this matrix with the constraint to avoid merging clusters associated to different names. We also integrate the prediction of few biometric models (anchors, some journalists) to directly identify speaker turns / face tracks before the clustering process. Our approach was evaluated on the REPERE corpus and reached an F-measure of 68.2% for speaker identification and 60.2% for face identification. Adding few biometric models improves results and leads to 82.4% and 65.6% for speaker and face identity respectively. By comparison, a mono-modal, supervised person identification system with 706 speaker models trained on matching development data and additional TV and radio data provides 67.8% F-measure, while 908 face models provide only 30.5% F-measure.
Fichier non déposé

Dates et versions

hal-01230628 , version 1 (18-11-2015)

Identifiants

Citer

Johann Poignant, Guillaume Fortier, Laurent Besacier, Georges Quénot. Naming multi-modal clusters to identify persons in TV broadcast. Multimedia Tools and Applications, 2016, 75 (15), pp.8999-9023. ⟨10.1007/s11042-015-2723-1⟩. ⟨hal-01230628⟩
161 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More