Multiple Text Segmentation for Statistical Language Modeling

Abstract : In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique seg-mentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmenta-tions lead to a better performance than the unique segmentation approach.
Complete list of metadatas

Cited literature [7 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01393605
Contributor : Brigitte Bigi <>
Submitted on : Monday, November 7, 2016 - 5:42:49 PM
Last modification on : Monday, July 8, 2019 - 3:10:05 PM
Long-term archiving on : Tuesday, March 14, 2017 - 11:36:28 PM

File

seng2009interspeech.pdf
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-01393605, version 1

Citation

Sopheap Seng, Laurent Besacier, Brigitte Bigi, Eric Castelli. Multiple Text Segmentation for Statistical Language Modeling. Interspeech, Sep 2009, Brighton, United Kingdom. pp.2663-2666. ⟨hal-01393605⟩

Share

Metrics

Record views

445

Files downloads

94