TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus

TArC. Un corpus d'arabish tunisien

Résumé

This article describes the collection process of the first morpho-syntactically annotated Tunisian arabish Corpus (TArC). Arabish is a spontaneous coding of Arabic Dialects (AD) in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the communication on digital devices. Arabish differs for each Arabic dialect and each arabish code-system is under-resourced. In the last few years, the attention of NLP on AD has considerably increased. TArC will be thus a useful support for different types of analyses, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses on the corpus. In order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian arabish.
TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus This article describes the collection process of the first morpho-syntactically annotated Tunisian arabish Corpus (TArC). Arabish is a spontaneous coding of Arabic Dialects (AD) in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the communication on digital devices. Arabish differs for each Arabic dialect and each arabish code-system is under-resourced. In the last few years, the attention of NLP on AD has considerably increased. TArC will be thus a useful support for different types of analyses, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses on the corpus. In order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian arabish.
Fichier principal
Vignette du fichier
133.pdf (298.82 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-02784772 , version 1 (07-06-2020)
hal-02784772 , version 2 (18-06-2020)
hal-02784772 , version 3 (23-06-2020)

Licence

Paternité - Pas d'utilisation commerciale - Pas de modification

Identifiants

  • HAL Id : hal-02784772 , version 1

Citer

Elisa Gugliotta, Marco Dinarelli. TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus. 6e conférence conjointe Journées d'Études sur la Parole (JEP, 31e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition), Jun 2020, Nancy, France. pp.232-240. ⟨hal-02784772v1⟩
221 Consultations
102 Téléchargements

Partager

Gmail Facebook X LinkedIn More