Information Theoretical and Statistical Features for Intrinsic Plagiarism Detection

Abstract : In this paper we present some information theoretical and statistical features including function word skip n-grams for detecting plagiarism intrinsically. We train a binary classifier with different feature sets and observe their performances. Basically, we propose a set of 36 features for classifying plagiarized and non-plagiarized texts in suspicious documents. Our experiment finds that entropy, relative entropy and correlation coefficient of function word skip n-gram frequency profiles are very effective features. The proposed feature set achieves F-Score of 85.10%.
Type de document :
Communication dans un congrès
16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Sep 2015, Prague, Czech Republic
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01617333
Contributeur : Rashedur Rahman <>
Soumis le : lundi 16 octobre 2017 - 14:50:36
Dernière modification le : mardi 20 novembre 2018 - 14:04:02

Identifiants

  • HAL Id : hal-01617333, version 1

Citation

Rashedur Rahman. Information Theoretical and Statistical Features for Intrinsic Plagiarism Detection. 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Sep 2015, Prague, Czech Republic. 〈hal-01617333〉

Partager

Métriques

Consultations de la notice

97