A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection

Abstract : In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present pre-existing corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to document), is based on both parallel and comparable corpora and contains human and machine translated texts. Moreover, it includes texts written by multiple types of authors (from average to professionals). With the obtained dataset, we conduct a systematic and rigorous evaluation of several state-of-the-art cross-language textual similarity detection methods. The evaluation results are reviewed and discussed. Finally, dataset and scripts are made publicly available on GitHub: http://github.com/FerreroJeremy/Cross-Language-Dataset.
Type de document :
Communication dans un congrès
10th edition of the Language Resources and Evaluation Conference, May 2016, Portorož, Slovenia. 〈http://lrec2016.lrec-conf.org/en/〉
Liste complète des métadonnées

Littérature citée [38 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01303135
Contributeur : Jérémy Ferrero <>
Soumis le : vendredi 15 avril 2016 - 23:40:53
Dernière modification le : jeudi 11 octobre 2018 - 08:48:03
Document(s) archivé(s) le : mardi 15 novembre 2016 - 04:40:23

Fichier

xample.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01303135, version 1

Collections

Citation

Jérémy Ferrero, Frédéric Agnès, Laurent Besacier, Didier Schwab. A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection. 10th edition of the Language Resources and Evaluation Conference, May 2016, Portorož, Slovenia. 〈http://lrec2016.lrec-conf.org/en/〉. 〈hal-01303135〉

Partager

Métriques

Consultations de la notice

329

Téléchargements de fichiers

389