Block-o-Matic: A web page segmentation framework

Andrés Sanoja 1 Stéphane Gançarski 1
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : In this paper we describe Block-o-Matic, a web page segmentation framework. It is a hybrid approach inspired by automated document processing methods and visual-based content segmentation techniques. A web page is associated with three structures: the DOM tree, the content structure and the logical structure. The DOM tree represents the HTML elements of a page, the content structure organizes page objects according to content's categories and geometry and finally the logical structure is the result of mapping content structure on the basis of the human-perceptible meaning that conforms the blocks. The logic structure represents the final segmentation. The segmentation process is divided into three phases: analysis, understanding and reconstruction of a web page. An evaluation is proposed in order to perform the evaluation of web page segmentations based on a ground truth of 400 pages classified into 16 categories. Block-o-Matic gives promising results.
Type de document :
Communication dans un congrès
Multimedia Computing and Systems (ICMCS), 2014 International Conference on, Apr 2014, Marrakesh, Morocco. IEEE, pp.595-600, 2014, 〈10.1109/ICMCS.2014.6911249〉
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01092787
Contributeur : Andrés Sanoja <>
Soumis le : mardi 9 décembre 2014 - 14:43:00
Dernière modification le : jeudi 22 novembre 2018 - 14:11:44

Identifiants

Collections

Citation

Andrés Sanoja, Stéphane Gançarski. Block-o-Matic: A web page segmentation framework. Multimedia Computing and Systems (ICMCS), 2014 International Conference on, Apr 2014, Marrakesh, Morocco. IEEE, pp.595-600, 2014, 〈10.1109/ICMCS.2014.6911249〉. 〈hal-01092787〉

Partager

Métriques

Consultations de la notice

146