Skip to Main content Skip to Navigation
Conference papers

Block-o-Matic: A web page segmentation framework

Andrés Sanoja 1 Stéphane Gançarski 1
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : In this paper we describe Block-o-Matic, a web page segmentation framework. It is a hybrid approach inspired by automated document processing methods and visual-based content segmentation techniques. A web page is associated with three structures: the DOM tree, the content structure and the logical structure. The DOM tree represents the HTML elements of a page, the content structure organizes page objects according to content's categories and geometry and finally the logical structure is the result of mapping content structure on the basis of the human-perceptible meaning that conforms the blocks. The logic structure represents the final segmentation. The segmentation process is divided into three phases: analysis, understanding and reconstruction of a web page. An evaluation is proposed in order to perform the evaluation of web page segmentations based on a ground truth of 400 pages classified into 16 categories. Block-o-Matic gives promising results.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01092787
Contributor : Andrés Sanoja <>
Submitted on : Tuesday, December 9, 2014 - 2:43:00 PM
Last modification on : Thursday, March 21, 2019 - 12:59:20 PM

Identifiers

Citation

Andrés Sanoja, Stéphane Gançarski. Block-o-Matic: A web page segmentation framework. Multimedia Computing and Systems (ICMCS), 2014 International Conference on, Apr 2014, Marrakesh, Morocco. pp.595-600, ⟨10.1109/ICMCS.2014.6911249⟩. ⟨hal-01092787⟩

Share

Metrics

Record views

223