Skip to Main content Skip to Navigation
Conference papers

Model-driven Web Page Segmentation for Non Visual Access

Judith Jeyafreeda Andrew Stéphane Ferrari 1 Fabrice Maurel 1 Gaël Dias 1 Emmanuel Giguet 1 
1 Equipe Hultech - Laboratoire GREYC - UMR6072
GREYC - Groupe de Recherche en Informatique, Image et Instrumentation de Caen
Abstract : Web page segmentation aims to break a large page into smaller blocks, in which contents with coherent semantics are kept together. Within this context, a great deal of approaches have been proposed without any specific end task in mind. In this paper, we study different segmentation strategies for the task of non visual skimming. For that purpose, we propose to segment web pages into visually coherent zones so that each zone can be represented by a set of relevant keywords that can be further synthesized into concurrent speech. As a consequence, we consider web page segmentation as a clustering problem of visual elements, where (1) a fixed number of clusters must be discovered, (2) the elements of a cluster should be visually connected and (3) all visual elements must be clustered. Therefore , we study variations of three existing algorithms, that comply to these constraints: K-means, F-K-means, and Guided Expansion. In particular, we evaluate different reading strategies for the positioning of the initial K seeds as well as a pre-clustering methodology for the Guided Expansion algorithm, which goal is to (1) fasten the clustering process and (2) reduce unbalance between clusters. The performed evaluation shows that the Guided Expansion algorithm evidences statistically increased results over the two other algorithms with the variations of the reading strategies. Nevertheless, improvements still need to be proposed to increase separateness.
Complete list of metadata

Cited literature [20 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02309612
Contributor : Giguet Emmanuel Connect in order to contact the contributor
Submitted on : Wednesday, October 9, 2019 - 2:07:16 PM
Last modification on : Saturday, June 25, 2022 - 9:54:01 AM

File

PACLING_2019_Model-Driven-Web-...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02309612, version 1

Citation

Judith Jeyafreeda Andrew, Stéphane Ferrari, Fabrice Maurel, Gaël Dias, Emmanuel Giguet. Model-driven Web Page Segmentation for Non Visual Access. 16th International Conference of the Pacific Association for Computational Linguistics (PACLING 2019), Oct 2019, Hanoï City, Vietnam. ⟨hal-02309612⟩

Share

Metrics

Record views

94

Files downloads

153