Multimodal embedding fusion for robust speaker role recognition in video broadcast

Person role recognition in video broadcasts consists in classifying people into roles such as anchor, journalist, guest, etc. Existing approaches mostly consider one modality, either audio (speaker role recognition) or image (shot role recognition), firstly because of the non-synchrony between both modalities, and secondly because of the lack of a video corpus annotated in both modalities. Deep Neural Networks (DNN) approaches offer the ability to learn simultaneously feature representations (embeddings) and classification functions. This paper presents a multimodal fusion of audio, text and image embeddings spaces for speaker role recognition in asynchronous data. Monomodal embeddings are trained on exogenous data and fine-tuned using a DNN on 70 hours of French Broadcasts corpus for the target task. Experiments on the REPERE corpus show the benefit of the embeddings level fusion compared to the monomodal embeddings systems and to the standard late fusion method.

Mots clés

Multimodal Speaker Embeddings Broadcast News Speaker role recognition

Domaines

Informatique [cs] Intelligence artificielle [cs.AI] Apprentissage [cs.LG] Informatique et langage [cs.CL]

Fichier principal

favre_asru2015b.pdf (162.73 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Sebastien Delecraz : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01475413

Soumis le : jeudi 23 février 2017-16:45:43

Dernière modification le : vendredi 22 mars 2024-18:24:04

Archivage à long terme le : mercredi 24 mai 2017-14:36:37

Dates et versions

hal-01475413 , version 1 (23-02-2017)

Identifiants

HAL Id : hal-01475413 , version 1
DOI : 10.1109/ASRU.2015.7404820

Citer

Mickael Rouvier, Sebastien Delecraz, Benoit Favre, Meriem Bendris, Frédéric Bechet. Multimodal embedding fusion for robust speaker role recognition in video broadcast. Automatic Speech Recognition and Understanding, Dec 2015, Scottsdale, United States. pp.383 - 389, ⟨10.1109/ASRU.2015.7404820⟩. ⟨hal-01475413⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN LIF CNRS UNIV-AMU EC-MARSEILLE LIS-LAB

242 Consultations

383 Téléchargements