Unsupervised Speaker Identification in TV Broadcast Based on Written Names

Johann Poignant; Laurent Besacier; Georges Quénot

doi:10.1109/TASLP.2014.2367822

Article Dans Une Revue IEEE Transactions on Audio, Speech and Language Processing Année : 2015

Unsupervised Speaker Identification in TV Broadcast Based on Written Names

(1) , (2) , (3)

1
2
3

Johann Poignant

Fonction : Auteur
PersonId : 934025

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Laurent Besacier

Fonction : Auteur
PersonId : 1521
IdHAL : laurent-besacier
ORCID : 0000-0001-7411-9125
IdRef : 079377017

Laboratoire d'Informatique de Grenoble

Georges Quénot

Fonction : Auteur
PersonId : 3114
IdHAL : georges-quenot
ORCID : 0000-0003-2117-247X
IdRef : 034104518

Modélisation et Recherche d’Information Multimédia [Grenoble]

Résumé

Identifying speakers in TV broadcast in an unsuper- vised way (i.e. without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming", modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure.

Mots clés

TV broadcast multimodal fusion Speaker identification speaker diarization written names

Domaines

Informatique et langage [cs.CL] Traitement du texte et du document

Fichier principal

POIGNANT--ASLP--2013-2.pdf (1.85 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Laurent Besacier : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01060827

Soumis le : jeudi 4 septembre 2014-12:07:23

Dernière modification le : jeudi 4 avril 2024-21:30:38

Archivage à long terme le : vendredi 5 décembre 2014-10:25:54

Dates et versions

hal-01060827 , version 1 (04-09-2014)

Identifiants

HAL Id : hal-01060827 , version 1
DOI : 10.1109/TASLP.2014.2367822

Citer

Johann Poignant, Laurent Besacier, Georges Quénot. Unsupervised Speaker Identification in TV Broadcast Based on Written Names. IEEE Transactions on Audio, Speech and Language Processing, 2015, 23 (1), pp.57-68. ⟨10.1109/TASLP.2014.2367822⟩. ⟨hal-01060827⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_TDCGE LIG_TDCGE_GETALP LIG_TDCGE_MRIM POLYTECH-GRENOBLE ANR LIG_SIDCH

217 Consultations

617 Téléchargements

Unsupervised Speaker Identification in TV Broadcast Based on Written Names

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager