Abstract : In this paper we present some information theoretical and statistical features including function word skip n-grams for detecting plagiarism intrinsically. We train a binary classifier with different feature sets and observe their performances. Basically, we propose a set of 36 features for classifying plagiarized and non-plagiarized texts in suspicious documents. Our experiment finds that entropy, relative entropy and correlation coefficient of function word skip n-gram frequency profiles are very effective features. The proposed feature set achieves F-Score of 85.10%.
Rashedur Rahman. Information Theoretical and Statistical Features for Intrinsic Plagiarism Detection. 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Sep 2015, Prague, Czech Republic. ⟨hal-01617333⟩