From Layout to Semantic: A Reranking Model for Mapping Web Documents to Mediated XML Representations

Guillaume Wisniewski 1 Patrick Gallinari 1
1 MALIRE - Machine Learning and Information Retrieval
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are first annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed method performs well in different contexts.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01336141
Contributor : Lip6 Publications <>
Submitted on : Wednesday, June 22, 2016 - 4:21:25 PM
Last modification on : Thursday, March 21, 2019 - 2:43:07 PM

Identifiers

  • HAL Id : hal-01336141, version 1

Citation

Guillaume Wisniewski, Patrick Gallinari. From Layout to Semantic: A Reranking Model for Mapping Web Documents to Mediated XML Representations. RIAO International Conference on Large-Scale Semantic Access to Content, May 2007, Pittsburgh, United States. pp.433-448. ⟨hal-01336141⟩

Share

Metrics

Record views

55