Skip to Main content Skip to Navigation
Journal articles

Splitting Arabic Texts into Elementary Discourse Units

Abstract : In this article, we propose the first work that investigates the feasibility of Arabic discourse segmentation into elementary discourse units within the segmented discourse representation theory framework. We first describe our annotation scheme that defines a set of principles to guide the segmentation process. Two corpora have been annotated according to this scheme: elementary school textbooks and newspaper documents extracted from the syntactically annotated Arabic Treebank. Then, we propose a multiclass supervised learning approach that predicts nested units. Our approach uses a combination of punctuation, morphological, lexical, and shallow syntactic features. We investigate how each feature contributes to the learning process. We show that an extensive morphological analysis is crucial to achieve good results in both corpora. In addition, we show that adding chunks does not boost the performance of our system.
Complete list of metadata

Cited literature [57 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01120621
Contributor : Open Archive Toulouse Archive Ouverte (oatao) <>
Submitted on : Tuesday, March 3, 2015 - 10:04:01 AM
Last modification on : Wednesday, June 9, 2021 - 10:00:28 AM
Long-term archiving on: : Sunday, April 16, 2017 - 11:24:17 AM

File

Keskes_12992.pdf
Files produced by the author(s)

Identifiers

Citation

Iskander Keskes, Farah Benamara, Lamia Hadrich Belguith. Splitting Arabic Texts into Elementary Discourse Units. ACM Transactions on Asian Language Information Processing, Association for Computing Machinery, 2014, vol. 13 (n° 2), pp. 1-23. ⟨10.1145/2601401⟩. ⟨hal-01120621⟩

Share

Metrics

Record views

474

Files downloads

1508