The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Commu-nication (CMC) genres with interactions in French as the main language, by assembling interac-tions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assembled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body, thanks to a new post element applied to textual messages and turns. The model is then instantiated through four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motivations for applying an automatic annotation process to the CoMeRe corpora. The wish to guarantee generic annotations led us not to consider any processing beyond morphosyntactic labelling, while prioritizing the automatic annotation of any degraded elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and de-scribe the processing pipeline for adding these. All CoMeRe corpora, as well as annotations, are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the acknowledgement of individual researchers' work in both the metadata and corpus reference, as well as appropriate licenses compliant with the OpenData perspective. The conclusion refers to short terms challenges with respect to NLP annotations and new collections of Wikipedia controversial talk pages and Tweets, that will be added to the CoMeRe databank.

Mots clés

Computer Mediated Communication CMC CoMeRe corpus

Domaines

Linguistique

Fichier principal

cmr-article-jlcl-v140111.pdf (514.69 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Thierry Chanier : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-00953507

Soumis le : vendredi 28 février 2014-12:02:50

Dernière modification le : jeudi 11 mai 2023-11:56:10

Archivage à long terme le : vendredi 30 mai 2014-15:20:23

Dates et versions

halshs-00953507 , version 1 (28-02-2014)

halshs-00953507 , version 2 (12-09-2014)

Identifiants

HAL Id : halshs-00953507 , version 1

Citer

Thierry Chanier, Céline Poudat, Benoit Sagot, Georges Antoniadis, Ciara R. Wigham, et al.. The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. 2014. ⟨halshs-00953507v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

1521 Consultations

1613 Téléchargements