Model-driven Web Page Segmentation for Non Visual Access - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Model-driven Web Page Segmentation for Non Visual Access

Judith Jeyafreeda Andrew
  • Fonction : Auteur
Fabrice Maurel
Gaël Dias

Résumé

Web page segmentation aims to break a large page into smaller blocks, in which contents with coherent semantics are kept together. Within this context, a great deal of approaches have been proposed without any specific end task in mind. In this paper, we study different segmentation strategies for the task of non visual skimming. For that purpose, we propose to segment web pages into visually coherent zones so that each zone can be represented by a set of relevant keywords that can be further synthesized into concurrent speech. As a consequence, we consider web page segmentation as a clustering problem of visual elements, where (1) a fixed number of clusters must be discovered, (2) the elements of a cluster should be visually connected and (3) all visual elements must be clustered. Therefore , we study variations of three existing algorithms, that comply to these constraints: K-means, F-K-means, and Guided Expansion. In particular, we evaluate different reading strategies for the positioning of the initial K seeds as well as a pre-clustering methodology for the Guided Expansion algorithm, which goal is to (1) fasten the clustering process and (2) reduce unbalance between clusters. The performed evaluation shows that the Guided Expansion algorithm evidences statistically increased results over the two other algorithms with the variations of the reading strategies. Nevertheless, improvements still need to be proposed to increase separateness.
Fichier principal
Vignette du fichier
PACLING_2019_Model-Driven-Web-Page-Segmentation.pdf (385 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02309612 , version 1 (09-10-2019)

Identifiants

  • HAL Id : hal-02309612 , version 1

Citer

Judith Jeyafreeda Andrew, Stéphane Ferrari, Fabrice Maurel, Gaël Dias, Emmanuel Giguet. Model-driven Web Page Segmentation for Non Visual Access. 16th International Conference of the Pacific Association for Computational Linguistics (PACLING 2019), Oct 2019, Hanoï City, Vietnam. ⟨hal-02309612⟩
128 Consultations
196 Téléchargements

Partager

Gmail Facebook X LinkedIn More