Skip to Main content Skip to Navigation
Journal articles

Modeling Arabic Language using statistical methods

Karima Meftouh 1 Med Tayeb Laskri 1 Kamel Smaïli 2
2 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : In this paper we propose to investigate statistical language models for Arabic. First, several experiments using different smoothing techniques are carried out on a small corpus extracted from a daily newspaper. The sparseness of the data leads us to investigate other solutions without increasing the size of the corpus. A word segmentation technique has been employed in order to increase the statistical viability of the corpus. An n-morpheme model has been developed which leads to a better performance in terms of normalized perplexity. The second experiment concerns the study of the behaviour of statistical models based on different kinds of corpora. The introduction of distant n-gram improves the baseline model. Finally we propose a comparative study of statistical language models for Arabic and several foreign languages. The objective of this study is to understand how to better model each of this languages. For foreign languages, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with Witten Bell method are more efficient.
Document type :
Journal articles
Complete list of metadata

Cited literature [2 references]  Display  Hide  Download

https://hal.inria.fr/inria-00582493
Contributor : Kamel Smaïli Connect in order to contact the contributor
Submitted on : Tuesday, November 14, 2017 - 12:18:33 PM
Last modification on : Friday, February 26, 2021 - 3:28:06 PM
Long-term archiving on: : Thursday, February 15, 2018 - 3:01:37 PM

File

Karima-Arabian-Journal.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00582493, version 1

Collections

Citation

Karima Meftouh, Med Tayeb Laskri, Kamel Smaïli. Modeling Arabic Language using statistical methods. Arabian Journal for Science and Engineering, King Fahd University of Petroleum and Minerals SAUDI ARABIA - Springer (en ligne), 2010, Theme issue on Arabic Computing, 35 (2C), pp.69-82. ⟨inria-00582493⟩

Share

Metrics

Record views

225

Files downloads

193