A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: AQA-Webcorp

Wided Bakari; Patrice Bellot; Mahmoud Neji

doi:10.3991/ijes.v4i2.5345

Article Dans Une Revue International Journal of Recent Contributions from Engineering, Science & IT (iJES) Année : 2016

A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: AQA-Webcorp

(1, 2) , (2) , (1)

1
2

Wided Bakari

Fonction : Auteur

Université de Sfax

Laboratoire des Sciences de l'Information et des Systèmes

Patrice Bellot

Fonction : Auteur
PersonId : 14204
IdHAL : patrice-bellot
ORCID : 0000-0001-8698-5055
IdRef : 079380956

Laboratoire des Sciences de l'Information et des Systèmes

Mahmoud Neji

Fonction : Auteur

Université de Sfax

Résumé

With the development of electronic media and the heterogeneity of Arabic data on the Web, the idea of building a clean corpus for certain applications of natural language processing, including machine translation, information retrieval, question answer, become more and more pressing. In this manuscript, we seek to create and develop our own corpus of pair's questions-texts. This constitution then will provide a better base for our experimentation step. Thus, we try to model this constitution by a method for Arabic insofar as it recovers texts from the web that could prove to be answers to our factual questions. To do this, we had to develop a java script that can extract from a given query a list of html pages. Then clean these pages to the extent of having a data base of texts and a corpus of pair's question-texts. In addition, we give preliminary results of our proposal method. Some investigations for the construction of Arabic corpus are also presented in this document.

Mots clés

Arabic corpus search engine Corpus building

Domaines

Recherche d'information [cs.IR] Traitement du texte et du document Réseaux sociaux et d'information [cs.SI] Web

Fichier principal

5345-19035-1-PB.pdf (1.41 Mo)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Patrice BELLOT : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01591539

Soumis le : jeudi 21 septembre 2017-15:06:52

Dernière modification le : mardi 5 décembre 2023-18:08:07

Dates et versions

hal-01591539 , version 1 (21-09-2017)

Identifiants

HAL Id : hal-01591539 , version 1
ARXIV : 1709.09404
DOI : 10.3991/ijes.v4i2.5345

Citer

Wided Bakari, Patrice Bellot, Mahmoud Neji. A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: AQA-Webcorp. International Journal of Recent Contributions from Engineering, Science & IT (iJES), 2016, 4 (2), pp.38-45. ⟨10.3991/ijes.v4i2.5345⟩. ⟨hal-01591539⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN CNRS UNIV-AMU LIS-LAB HESAM IRENAV LAMPA LCPI LABOMAP LISPEN MSMP

119 Consultations

97 Téléchargements

A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: AQA-Webcorp

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager