Probabilistic Model for Structured Document Mapping

Abstract : We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machine learning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. This model relies on the concept of meta-document which is a latent variable providing a link between input and target documents. It allows us to learn the correspondences when the input documents are expressed in a large variety of schemas. We then detail an instance of the general model for the particular task of HTML to XML conversion. This instance is tested on three different corpora using two different inference methods: a dynamic programming method and an approximate LaSO-based method.
Document type :
Conference papers
Complete list of metadatas
Contributor : Lip6 Publications <>
Submitted on : Wednesday, June 22, 2016 - 4:23:18 PM
Last modification on : Thursday, March 21, 2019 - 2:43:09 PM

Links full text



Guillaume Wisniewski, Francis Maes, Ludovic Denoyer, Patrick Gallinari. Probabilistic Model for Structured Document Mapping. 5th International Conference on Machine Learning and Data Mining for Pattern Recognition (MLDM'07'), Jul 2007, Leizig, Germany. pp.854-867, ⟨10.1007/978-3-540-73499-4_64⟩. ⟨hal-01336148⟩



Record views