Skip to Main content Skip to Navigation
Conference papers

Web page segmentation evaluation

Andrés Sanoja 1 Stéphane Gançarski 1
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : In this paper, we present a framework for evaluating segmentation algorithms for Web pages. Web page segmentation consists in dividing a Web page into coherent fragments, called blocks. Each block represents one distinct information element in the page. We define an evaluation model that includes different metrics to evaluate the quality of a segmentation obtained with a given algorithm. Those metrics compute the distance between the obtained segmentation and a manually built segmentation that serves as a ground truth. We apply our framework to four state-of-the-art segmentation algorithms (BOM, Block Fusion, VIPS and JVIPS) on several categories (types) of Web pages. Results show that the tested algorithms usually perform rather well for text extraction, but may have serious problems for the extraction of geometry. They also show that the relative quality of a segmentation algorithm depends on the category of the segmented page.
Document type :
Conference papers
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-01500681
Contributor : Stéphane Gançarski <>
Submitted on : Monday, April 3, 2017 - 3:33:54 PM
Last modification on : Friday, January 8, 2021 - 5:32:09 PM

Identifiers

Citation

Andrés Sanoja, Stéphane Gançarski. Web page segmentation evaluation. 30th Annual ACM Symposium on Applied Computing , acm, Apr 2015, Salamanca, Spain. ⟨10.1145/2695664.2695786⟩. ⟨hal-01500681⟩

Share

Metrics

Record views

163