Using closely-related language to build an ASR for a very under-resourced language: Iban

Abstract : This paper describes our work on automatic speech recognition system (ASR) for an under-resourced language, Iban, a language that is mainly spoken in Sarawak, Malaysia. We collected 8 hours of data to begin this study due to no resources for ASR exist. We employed bootstrapping techniques involving a closely-related language for rapidly building and improve an Iban system. First, we used already available data from Malay, a local dominant language in Malaysia, to bootstrap grapheme-to-phoneme system (G2P) for the target language. We also built various types of G2Ps, including a grapheme-based and an English G2P, to produce different versions of dictionaries. We tested all of the dictionaries on the Iban ASR to provide us the best version. Second, we improved the baseline GMM system word error rate (WER) result by utilizing subspace Gaussian mixture models (SGMM). To test, we set two levels of data sparseness on Iban data; 7 hours and 1 hour transcribed speech. We investigated cross-lingual SGMM where the shared parameters were obtained either in monolingual or multilingual fashion and then applied to the target language for training. Experiments on out-of-language data, English and Malay, as source languages result in lower WERs when Iban data is very limited.
Document type :
Conference papers
Liste complète des métadonnées

Cited literature [28 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01055576
Contributor : Laurent Besacier <>
Submitted on : Wednesday, August 13, 2014 - 10:14:37 AM
Last modification on : Friday, February 15, 2019 - 6:15:25 PM
Document(s) archivé(s) le : Wednesday, November 26, 2014 - 11:50:32 PM

File

IS14full_paper-sarah_2.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01055576, version 1

Collections

Citation

Sarah Samson Juan, Laurent Besacier, Benjamin Lecouteux, Tan Tien Ping. Using closely-related language to build an ASR for a very under-resourced language: Iban. Oriental COCOSDA 2014, Sep 2014, Phuket, Thailand. 5 p. ⟨hal-01055576⟩

Share

Metrics

Record views

286

Files downloads

551