A comparative study of bottom-up and top-down approaches to speaker diarization

Abstract : This paper presents a theoretical framework to analyze the relative merits of the two most general, dominant approaches to speaker diarization involving bottom-up and top-down hierarchical clustering. We present an original qualitative comparison which argues how the two approaches are likely to exhibit different behavior in speaker inventory optimization and model training: bottom-up approaches will capture comparatively purer models and will thus be more sensitive to nuisance variation such as that related to the speech content; top-down approaches, in contrast, will produce less discriminative speaker models but, importantly, models which are potentially better normalized against nuisance variation. We report experiments conducted on two standard, single-channel NIST RT evaluation datasets which validate our hypotheses. Results show that competitive performance can be achieved with both bottom-up and top-down approaches (average DERs of 21% and 22%), and that neither approach is superior. Speaker purification, which aims to improve speaker discrimination, gives more consistent improvements with the top-down system than with the bottom-up system (average DERs of 19% and 25%), thereby confirming that the top-down system is less discriminative and that the bottom-up system is less stable. Finally, we report a new combination strategy that exploits the merits of the two approaches. Combination delivers an average DER of 17% and confirms the intrinsic complementary of the two approaches.
Type de document :
Article dans une revue
IEEE transactions on acoustics, speech, and signal processing, Institute of Electrical and Electronics Engineers (IEEE), 2010, pp.1
Liste complète des métadonnées

Littérature citée [41 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-00733394
Contributeur : Simon Bozonnet <>
Soumis le : mardi 18 septembre 2012 - 15:32:48
Dernière modification le : vendredi 26 janvier 2018 - 10:47:09
Document(s) archivé(s) le : mercredi 19 décembre 2012 - 03:45:30

Fichier

IEEE_Transaction2010_Comparati...
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00733394, version 1

Collections

Citation

Nicholas Evans, Simon Bozonnet, Dong Wang, Corinne Fredouille, Raphaël Troncy. A comparative study of bottom-up and top-down approaches to speaker diarization. IEEE transactions on acoustics, speech, and signal processing, Institute of Electrical and Electronics Engineers (IEEE), 2010, pp.1. 〈hal-00733394〉

Partager

Métriques

Consultations de la notice

133

Téléchargements de fichiers

479