Measuring the comparability of multilingual corpora extracted from Twitter and others - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

Measuring the comparability of multilingual corpora extracted from Twitter and others

Résumé

Multilingual corpora are widely exploited in several tasks of natural language processing, these corpora are principally of two sorts: comparable and parallel corpora. The comparable corpora gather texts in several languages dealing with analogous subjects but are not translations of each other such as in parallel corpora. In this paper, a comparative study on two stemming techniques is conducted in order to improve the comparability measure based on a bilingual dictionary. These methods are: Buckwalter Arabic Morphological Analyzer (BAMA) and a proposed approach based on Light Stemming (LS) adapted specifically to Twitter, then we combined them. We evaluated and compared these techniques on three different (English-Arabic) corpora: a corpus extracted from the social network Twit-ter, Euronews and a parallel corpus extracted from newspapers (ANN). The experimental results show that the best comparability measure is achieved for the combination of BAMA with LS which leads to a similarity of 61% for Twitter, 52% for Euronews and 65% for ANN. For a confidence of 40% we aligned 73.8% of Arabic and English tweets.
Fichier principal
Vignette du fichier
AbidiSmailiHrTAL2016.pdf (344.4 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01536076 , version 1 (19-06-2017)
hal-01536076 , version 2 (04-09-2017)

Identifiants

  • HAL Id : hal-01536076 , version 2

Citer

Abidi Karima, Kamel Smaili. Measuring the comparability of multilingual corpora extracted from Twitter and others. HrTAL2016 - Tenth International Conference on Natural Language Processing, Sep 2016, Dubrovnik, Croatia. ⟨hal-01536076v2⟩
370 Consultations
288 Téléchargements

Partager

Gmail Facebook X LinkedIn More