A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection

Abstract : In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present pre-existing corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to document), is based on both parallel and comparable corpora and contains human and machine translated texts. Moreover, it includes texts written by multiple types of authors (from average to professionals). With the obtained dataset, we conduct a systematic and rigorous evaluation of several state-of-the-art cross-language textual similarity detection methods. The evaluation results are reviewed and discussed. Finally, dataset and scripts are made publicly available on GitHub: http://github.com/FerreroJeremy/Cross-Language-Dataset.
Document type :
Conference papers
Liste complète des métadonnées

Cited literature [38 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01303135
Contributor : Jérémy Ferrero <>
Submitted on : Friday, April 15, 2016 - 11:40:53 PM
Last modification on : Tuesday, February 12, 2019 - 1:30:55 AM
Document(s) archivé(s) le : Tuesday, November 15, 2016 - 4:40:23 AM

File

xample.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01303135, version 1

Collections

Citation

Jérémy Ferrero, Frédéric Agnès, Laurent Besacier, Didier Schwab. A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection. 10th edition of the Language Resources and Evaluation Conference, May 2016, Portorož, Slovenia. ⟨hal-01303135⟩

Share

Metrics

Record views

341

Files downloads

441