How Diachronic Text Corpora Affect Context based Retrieval of OOV Proper Names for Audio News

Imran Sheikh 1 Irina Illina 1 Dominique Fohr 1
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Out-Of-Vocabulary (OOV) words missed by Large Vocabulary Continuous Speech Recognition (LVCSR) systems can be recovered with the help of topic and semantic context of the OOV words captured from a diachronic text corpus. In this paper we investigate how the choice of documents for the diachronic text corpora affects the retrieval of OOV Proper Names (PNs) relevant to an audio document. We first present our diachronic French broadcast news datasets, which highlight the motivation of our study on OOV PNs. Then the effect of using diachronic text data from different sources and a different time span is analysed. With OOV PN retrieval experiments on French broadcast news videos, we conclude that a diachronic corpus with text from different sources leads to better retrieval performance than one relying on text from single source or from a longer time span.
Type de document :
Communication dans un congrès
LREC 2016, May 2016, Portoroz, Slovenia. proceedings of LREC 2016
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01331714
Contributeur : Dominique Fohr <>
Soumis le : jeudi 20 octobre 2016 - 09:56:02
Dernière modification le : mardi 18 décembre 2018 - 16:38:02

Fichier

draft_7Mar2016.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01331714, version 1

Collections

Citation

Imran Sheikh, Irina Illina, Dominique Fohr. How Diachronic Text Corpora Affect Context based Retrieval of OOV Proper Names for Audio News. LREC 2016, May 2016, Portoroz, Slovenia. proceedings of LREC 2016. 〈hal-01331714〉

Partager

Métriques

Consultations de la notice

286

Téléchargements de fichiers

165