Structural and Visual Comparisons for Web Page Archiving

Marc Teva Law 1 Nicolas Thome 1 Stéphane Gançarski 2 Matthieu Cord 1
1 MALIRE - Machine Learning and Information Retrieval
LIP6 - Laboratoire d'Informatique de Paris 6
2 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01272800
Contributor : Lip6 Publications <>
Submitted on : Thursday, February 11, 2016 - 1:42:03 PM
Last modification on : Thursday, March 21, 2019 - 1:07:25 PM

Identifiers

Citation

Marc Teva Law, Nicolas Thome, Stéphane Gançarski, Matthieu Cord. Structural and Visual Comparisons for Web Page Archiving. 12th edition of the ACM Symposium on Document Engineering, DocEng'12, Sep 2012, Paris, France. pp.117-120, ⟨10.1145/2361354.2361380⟩. ⟨hal-01272800⟩

Share

Metrics

Record views

121