TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus

Elisa Gugliotta; Marco Dinarelli

Communication Dans Un Congrès Année : 2020

TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus

TArC. Un corpus d'arabish tunisien

(1, 2) , (2)

1
2

Elisa Gugliotta

Fonction : Auteur
PersonId : 1150987
ORCID : 0000-0002-9504-0480
IdRef : 263982599

Università degli Studi di Roma "La Sapienza" = Sapienza University [Rome]

Laboratoire d'Informatique de Grenoble

Marco Dinarelli

Fonction : Auteur
PersonId : 12699
IdHAL : marco-dinarelli
IdRef : 22461939X

Laboratoire d'Informatique de Grenoble

Résumé

This article describes the collection process of the first morpho-syntactically annotated Tunisian arabish Corpus (TArC). Arabish is a spontaneous coding of Arabic Dialects (AD) in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the communication on digital devices. Arabish differs for each Arabic dialect and each arabish code-system is under-resourced. In the last few years, the attention of NLP on AD has considerably increased. TArC will be thus a useful support for different types of analyses, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses on the corpus. In order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian arabish.

TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus This article describes the collection process of the first morpho-syntactically annotated Tunisian arabish Corpus (TArC). Arabish is a spontaneous coding of Arabic Dialects (AD) in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the communication on digital devices. Arabish differs for each Arabic dialect and each arabish code-system is under-resourced. In the last few years, the attention of NLP on AD has considerably increased. TArC will be thus a useful support for different types of analyses, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses on the corpus. In order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian arabish.

Mots clés

Tunisian arabish Corpus Arabic Dialect Arabizi

Corpus d’arabish tunisien Dialecte arabe Arabizi

Domaines

Informatique et langage [cs.CL]

Fichier principal

133.pdf (298.82 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Sylvain Pogodalla : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02784772

Soumis le : dimanche 7 juin 2020-20:56:43

Dernière modification le : vendredi 17 novembre 2023-15:37:44

Dates et versions

hal-02784772 , version 1 (07-06-2020)

hal-02784772 , version 2 (18-06-2020)

hal-02784772 , version 3 (23-06-2020)

Licence

Paternité - Pas d'utilisation commerciale - Pas de modification

Identifiants

HAL Id : hal-02784772 , version 1

Citer

Elisa Gugliotta, Marco Dinarelli. TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus. 6e conférence conjointe Journées d'Études sur la Parole (JEP, 31e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition), Jun 2020, Nancy, France. pp.232-240. ⟨hal-02784772v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

221 Consultations

102 Téléchargements

TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus

TArC. Un corpus d'arabish tunisien

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Partager