Reuse and Plagiarism in Speech and Natural Language Processing

Joseph J Mariani; Gil Francopoulo; Patrick Paroubek

Article Dans Une Revue International Journal on Digital Libraries Année : 2017

Reuse and Plagiarism in Speech and Natural Language Processing

(1) , (1) , (1)

Joseph J Mariani

Fonction : Auteur
PersonId : 20614
IdHAL : joseph-mariani
ORCID : 0000-0001-7488-293X
IdRef : 066980542

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Gil Francopoulo

Fonction : Auteur
PersonId : 1034584

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Patrick Paroubek

Fonction : Auteur
PersonId : 20704
IdHAL : patrick-paroubek
ORCID : 0000-0002-4302-1894
IdRef : 057218730

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Résumé

The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy and paste operations between articles in the domain of Natural Language Processing (NLP), including Speech Processing. The search space of the comparisons is a corpus labeled as NLP4NLP, which includes 34 different conferences and journals and gathers a large part of the NLP activity over the past 50 years. This study considers the similarity between the papers of each individual event and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a paper borrowing a fragment of text from another paper of the corpus (that we will call the source paper), or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the corpus. The results show that self-reuse is rather a common practice, but that plagiarism seems to be very unusual, and that both stay within legal and ethical limits.

Mots clés

Plagiarism Detection Text reuse Natural Language Processing Speech Processing Scientometrics Informetrics

Domaines

Informatique [cs] Informatique et langage [cs.CL]

Limsi Publications : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01840700

Soumis le : lundi 16 juillet 2018-15:44:52

Dernière modification le : samedi 7 octobre 2023-21:36:20

Dates et versions

hal-01840700 , version 1 (16-07-2018)

Identifiants

HAL Id : hal-01840700 , version 1

Citer

Joseph J Mariani, Gil Francopoulo, Patrick Paroubek. Reuse and Plagiarism in Speech and Natural Language Processing. International Journal on Digital Libraries, 2017, 18, pp.1-14. ⟨hal-01840700⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIMSI UNIV-PARIS-SACLAY SORBONNE-UNIVERSITE LISN GS-ENGINEERING GS-COMPUTER-SCIENCE

55 Consultations

0 Téléchargements

Reuse and Plagiarism in Speech and Natural Language Processing

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager