Skip to Main content Skip to Navigation
Conference papers

Web Page Segmentation for Non Visual Skimming

Judith Jeyafreeda Andrew Stéphane Ferrari 1 Fabrice Maurel 1 Gaël Dias 1 Emmanuel Giguet 1 
1 Equipe Hultech - Laboratoire GREYC - UMR6072
GREYC - Groupe de Recherche en Informatique, Image et Instrumentation de Caen
Abstract : Web page segmentation aims to break a page into smaller blocks, in which contents with coherent semantics are kept together. Examples of tasks targeted by such a technique are advertisement detection or main content extraction. In this paper, we study different seg-mentation strategies for the task of non visual skimming. For that purpose, we consider web page segmentation as a clustering problem of visual elements, where (1) all visual elements must be clustered, (2) a fixed number of clusters must be discovered, and (3) the elements of a cluster should be visually connected. Therefore, we study three different algorithms that comply to these constraints: K-means, F-K-means, and Guided Expansion. Evaluation shows that Guided Expansion evidences statistically-relevant results in terms of compactness and separateness, and satisfies more logical constraints when compared to the other strategies.
Complete list of metadata

Cited literature [16 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02309625
Contributor : Giguet Emmanuel Connect in order to contact the contributor
Submitted on : Wednesday, October 9, 2019 - 2:13:45 PM
Last modification on : Saturday, June 25, 2022 - 9:54:01 AM

File

PACLIC_33_2019-Web_Page_Segmen...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02309625, version 1

Citation

Judith Jeyafreeda Andrew, Stéphane Ferrari, Fabrice Maurel, Gaël Dias, Emmanuel Giguet. Web Page Segmentation for Non Visual Skimming. The 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33), Sep 2019, Hakodate, Japan. ⟨hal-02309625⟩

Share

Metrics

Record views

166

Files downloads

226