Skip to Main content Skip to Navigation
Conference papers

Multiple Text Segmentation for Statistical Language Modeling

Abstract : In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique seg-mentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmenta-tions lead to a better performance than the unique segmentation approach.
Complete list of metadatas

Cited literature [7 references]  Display  Hide  Download
Contributor : Brigitte Bigi <>
Submitted on : Monday, November 7, 2016 - 5:42:49 PM
Last modification on : Tuesday, November 24, 2020 - 4:20:03 PM
Long-term archiving on: : Tuesday, March 14, 2017 - 11:36:28 PM


Publisher files allowed on an open archive


  • HAL Id : hal-01393605, version 1


Sopheap Seng, Laurent Besacier, Brigitte Bigi, Eric Castelli. Multiple Text Segmentation for Statistical Language Modeling. Interspeech, Sep 2009, Brighton, United Kingdom. pp.2663-2666. ⟨hal-01393605⟩



Record views


Files downloads