Block-o-Matic: A web page segmentation framework - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2014

Block-o-Matic: A web page segmentation framework

Block-o-Matic: un framework pour la segmentation des pages Web

Andrés Sanoja
  • Fonction : Auteur
  • PersonId : 934855
Stéphane Gançarski

Résumé

In this paper we describe Block-o-Matic, a web page segmentation framework. It is a hybrid approach inspired by automated document processing methods and visual-based content segmentation techniques. A web page is associated with three structures: the DOM tree, the content structure and the logical structure. The DOM tree represents the HTML elements of a page, the content structure organizes page objects according to content's categories and geometry and finally the logical structure is the result of mapping content structure on the basis of the human-perceptible meaning that conforms the blocks. The logic structure represents the final segmentation. The segmentation process is divided into three phases: analysis, understanding and reconstruction of a web page. An evaluation is proposed in order to perform the evaluation of web page segmentations based on a ground truth of 400 pages classified into 16 categories. Block-o-Matic gives promising results.

Domaines

Informatique
Fichier non déposé

Dates et versions

hal-01092787 , version 1 (09-12-2014)

Identifiants

Citer

Andrés Sanoja, Stéphane Gançarski. Block-o-Matic: A web page segmentation framework. Multimedia Computing and Systems (ICMCS), 2014 International Conference on, Apr 2014, Marrakesh, Morocco. pp.595-600, ⟨10.1109/ICMCS.2014.6911249⟩. ⟨hal-01092787⟩
138 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More