Web page segmentation evaluation - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2015

Web page segmentation evaluation

Andrés Sanoja
  • Fonction : Auteur
  • PersonId : 934855
Stéphane Gançarski

Résumé

In this paper, we present a framework for evaluating segmentation algorithms for Web pages. Web page segmentation consists in dividing a Web page into coherent fragments, called blocks. Each block represents one distinct information element in the page. We define an evaluation model that includes different metrics to evaluate the quality of a segmentation obtained with a given algorithm. Those metrics compute the distance between the obtained segmentation and a manually built segmentation that serves as a ground truth. We apply our framework to four state-of-the-art segmentation algorithms (BOM, Block Fusion, VIPS and JVIPS) on several categories (types) of Web pages. Results show that the tested algorithms usually perform rather well for text extraction, but may have serious problems for the extraction of geometry. They also show that the relative quality of a segmentation algorithm depends on the category of the segmented page.
Fichier non déposé

Dates et versions

hal-01500681 , version 1 (03-04-2017)

Identifiants

Citer

Andrés Sanoja, Stéphane Gançarski. Web page segmentation evaluation. 30th Annual ACM Symposium on Applied Computing , acm, Apr 2015, Salamanca, Spain. ⟨10.1145/2695664.2695786⟩. ⟨hal-01500681⟩
97 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More