Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model

Abstract : We address the problem of learning to map automatically flat and semi-structured documents onto a mediated target XML schema. This problem is motivated by the recent development of applications for searching and mining semi-structured document sources and corpora. Academic research has mainly dealt with homogeneous collections. In practical applications, data come from multiple heterogeneous sources and mining such collections requires defining a mapping or correspondence between the different document formats. Automating the design of such mappings has rapidly become a key issue for these applications. We propose a machine learning approach to this problem where the mapping is learned from pairs of input and corresponding target documents provided by a user. The mapping process is formalized as a Markov Decision Process, and training is performed through a classical machine learning framework known as Reinforcement Learning. The resulting model is able to cope with complex mappings while keeping a linear complexity. We describe a set of experiments on several corpora representative of different mapping tasks and show that the method is able to learn mappings with a high accuracy on different corpora.
Document type :
Book sections
Complete list of metadatas
Contributor : Lip6 Publications <>
Submitted on : Wednesday, April 20, 2016 - 4:11:36 PM
Last modification on : Thursday, March 21, 2019 - 1:19:38 PM

Links full text



Francis Maes, Ludovic Denoyer, Patrick Gallinari. Corpus-Based Structure Mapping of XML Document Corpora: A Reinforcement Learning Based Model. Modeling, Learning, and Processing of Text Technological Data Structures, 370, Springer Berlin/Heidelberg, pp.249-266, 2012, Studies in Computational Intelligence, 978-3-642-22612-0. ⟨10.1007/978-3-642-22613-7_13⟩. ⟨hal-01305048⟩



Record views