A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Abstract : Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources and some even lack a stable orthography. Building systems under these almost zero resource conditions is not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered, unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We detail how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.
Type de document :
Communication dans un congrès
Language Resources and Evaluation Conference (LREC), May 2018, Miyazaki, Japan
Liste complète des métadonnées

Littérature citée [27 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01807093
Contributeur : Laurent Besacier <>
Soumis le : lundi 4 juin 2018 - 14:02:31
Dernière modification le : mardi 20 novembre 2018 - 14:04:02
Document(s) archivé(s) le : mercredi 5 septembre 2018 - 14:21:37

Fichier

lrec2018_mboshi_final-3.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01807093, version 1

Citation

P. Godard, G Adda, M Adda-Decker, J Benjumea, Laurent Besacier, et al.. A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments. Language Resources and Evaluation Conference (LREC), May 2018, Miyazaki, Japan. 〈hal-01807093〉

Partager

Métriques

Consultations de la notice

67

Téléchargements de fichiers

27