Using the Web for fast language model construction in minority languages

Abstract : The design and construction of a language model for minority languages is a hard task. By minority language, we mean a language with small available resources, especially for the statistical learning problem. In this paper, a new methodology for fast language model construction in minority languages is proposed. It is based on the use of Web resources to collect and make efficient textual corpora. By using efficient filtering techniques, this methodology allows a quick and efficient construction of a language model with a small cost in term of computational and human resources. Our primary experiments have shown excellent performance of the Web language models vs newspaper language models using the proposed filtering methods on a majority language (French). Following the same way for a minority language (Vietnamese), a valuable language model was constructed in 3 month with only 15% new development to convert some filtering tools.
Complete list of metadatas

Cited literature [11 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01392377
Contributor : Brigitte Bigi <>
Submitted on : Friday, November 4, 2016 - 12:09:29 PM
Last modification on : Tuesday, July 9, 2019 - 1:26:58 AM
Long-term archiving on : Sunday, February 5, 2017 - 1:46:04 PM

File

7a1a3724fcfd19af467a1608cfc392...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01392377, version 1

Collections

Citation

Viet-Bac Le, Brigitte Bigi, Laurent Besacier, Eric Castelli. Using the Web for fast language model construction in minority languages. Eurospeech, 2003, Geneva, Switzerland. pp.3117--3120. ⟨hal-01392377⟩

Share

Metrics

Record views

270

Files downloads

103