Skip to Main content Skip to Navigation

Syntax tree fingerprinting: a foundation for source code similarity detection

Abstract : Plagiarism detection and clone refactoring in software depend on one common concern: nding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modi cations are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Depen- dency Graph (PDG), we believe that the AST could e ciently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree nger- printing. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that e ciently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.
Complete list of metadata

Cited literature [36 references]  Display  Hide  Download
Contributor : Etienne Duris Connect in order to contact the contributor
Submitted on : Thursday, September 29, 2011 - 4:07:23 PM
Last modification on : Thursday, September 29, 2022 - 2:21:15 PM
Long-term archiving on: : Tuesday, November 13, 2012 - 2:50:40 PM


Files produced by the author(s)


  • HAL Id : hal-00627811, version 1


Michel Chilowicz, Étienne Duris, Gilles Roussel. Syntax tree fingerprinting: a foundation for source code similarity detection. 2009. ⟨hal-00627811⟩



Record views


Files downloads