Measuring the comparability of multilingual corpora extracted from Twitter and others

Abidi Karima; Kamel Smaili

Communication Dans Un Congrès Année : 2016

Measuring the comparability of multilingual corpora extracted from Twitter and others

(1, 2) , (1)

1
2

Abidi Karima

Fonction : Auteur

Statistical Machine Translation and Speech Modelization and Text

École Nationale Supérieure d'Informatique [Alger]

Kamel Smaili

Fonction : Auteur
PersonId : 2521
IdHAL : kamel-smaili
IdRef : 034429700

Statistical Machine Translation and Speech Modelization and Text

Résumé

Multilingual corpora are widely exploited in several tasks of natural language processing, these corpora are principally of two sorts: comparable and parallel corpora. The comparable corpora gather texts in several languages dealing with analogous subjects but are not translations of each other such as in parallel corpora. In this paper, a comparative study on two stemming techniques is conducted in order to improve the comparability measure based on a bilingual dictionary. These methods are: Buckwalter Arabic Morphological Analyzer (BAMA) and a proposed approach based on Light Stemming (LS) adapted specifically to Twitter, then we combined them. We evaluated and compared these techniques on three different (English-Arabic) corpora: a corpus extracted from the social network Twit-ter, Euronews and a parallel corpus extracted from newspapers (ANN). The experimental results show that the best comparability measure is achieved for the combination of BAMA with LS which leads to a similarity of 61% for Twitter, 52% for Euronews and 65% for ANN. For a confidence of 40% we aligned 73.8% of Arabic and English tweets.

Mots clés

Stemming comparability measure Twitter

Domaines

Informatique et langage [cs.CL]

Fichier principal

AbidiSmailiHrTAL2016.pdf (344.4 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Kamel Smaïli : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01536076

Soumis le : lundi 4 septembre 2017-14:35:09

Dernière modification le : lundi 11 septembre 2023-17:41:19

Dates et versions

hal-01536076 , version 1 (19-06-2017)

hal-01536076 , version 2 (04-09-2017)

Identifiants

HAL Id : hal-01536076 , version 2

Citer

Abidi Karima, Kamel Smaili. Measuring the comparability of multilingual corpora extracted from Twitter and others. HrTAL2016 - Tenth International Conference on Natural Language Processing, Sep 2016, Dubrovnik, Croatia. ⟨hal-01536076v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE LORIA LORIA-NLPKD

370 Consultations

288 Téléchargements

Measuring the comparability of multilingual corpora extracted from Twitter and others

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager