Multiple Text Segmentation for Statistical Language Modeling - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2009

Multiple Text Segmentation for Statistical Language Modeling

Résumé

In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique seg-mentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmenta-tions lead to a better performance than the unique segmentation approach.
Fichier principal
Vignette du fichier
seng2009interspeech.pdf (511.03 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-01393605 , version 1 (07-11-2016)

Identifiants

  • HAL Id : hal-01393605 , version 1

Citer

Sopheap Seng, Laurent Besacier, Brigitte Bigi, Eric Castelli. Multiple Text Segmentation for Statistical Language Modeling. Interspeech, Sep 2009, Brighton, United Kingdom. pp.2663-2666. ⟨hal-01393605⟩
256 Consultations
102 Téléchargements

Partager

Gmail Facebook X LinkedIn More