From Layout to Semantic: A Reranking Model for Mapping Web Documents to Mediated XML Representations - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2007

From Layout to Semantic: A Reranking Model for Mapping Web Documents to Mediated XML Representations

Résumé

Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are first annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed method performs well in different contexts.
Fichier non déposé

Dates et versions

hal-01336141 , version 1 (22-06-2016)

Identifiants

  • HAL Id : hal-01336141 , version 1

Citer

Guillaume Wisniewski, Patrick Gallinari. From Layout to Semantic: A Reranking Model for Mapping Web Documents to Mediated XML Representations. RIAO International Conference on Large-Scale Semantic Access to Content, May 2007, Pittsburgh, United States. pp.433-448. ⟨hal-01336141⟩
42 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More