Skip to Main content Skip to Navigation
Conference papers

Structural and Visual Comparisons for Web Page Archiving

Marc Teva Law 1 Nicolas Thome 1 Stéphane Gançarski 2 Matthieu Cord 1
1 MALIRE - Machine Learning and Information Retrieval
LIP6 - Laboratoire d'Informatique de Paris 6
2 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.
Document type :
Conference papers
Complete list of metadata
Contributor : Lip6 Publications Connect in order to contact the contributor
Submitted on : Thursday, February 11, 2016 - 1:42:03 PM
Last modification on : Friday, January 8, 2021 - 5:34:11 PM



Marc Teva Law, Nicolas Thome, Stéphane Gançarski, Matthieu Cord. Structural and Visual Comparisons for Web Page Archiving. 12th edition of the ACM Symposium on Document Engineering, DocEng'12, Sep 2012, Paris, France. pp.117-120, ⟨10.1145/2361354.2361380⟩. ⟨hal-01272800⟩



Record views