Structural and Visual Similarity Learning for Web Page Archiving

Marc Teva Law 1 Carlos Sureda Gutierrez 1 Nicolas Thome 1 Stéphane Gançarski 2 Matthieu Cord 1
1 MALIRE - Machine Learning and Information Retrieval
LIP6 - Laboratoire d'Informatique de Paris 6
2 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : We present in this paper a Web page archiving approach combining image and structural techniques. Our main goal is to learn a similarity between Web pages in order to detect whether successive versions of pages are similar or not. Our system is based on a visual similarity measure designed for Web pages. Combined with a structural analysis of Web page source codes, a supervised feature selection method adapted to Web archiving is proposed. Experiments on real Web archives are reported including scalability issues.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01271765
Contributor : Lip6 Publications <>
Submitted on : Tuesday, February 9, 2016 - 3:52:51 PM
Last modification on : Thursday, March 21, 2019 - 2:34:09 PM

Identifiers

Citation

Marc Teva Law, Carlos Sureda Gutierrez, Nicolas Thome, Stéphane Gançarski, Matthieu Cord. Structural and Visual Similarity Learning for Web Page Archiving. 10th workshop on Content-Based Multimedia Indexing (CBMI), Jun 2012, Annecy, France. pp.1-6, ⟨10.1109/CBMI.2012.6269849⟩. ⟨hal-01271765⟩

Share

Metrics

Record views

111