Stochastic models for document restructuration

Abstract : Document (re)structuration consists in mapping documents coming from different sources, with different formats, onto a predefined semi-structured format. This generic problem appears in different applications settings like heterogeneous semi-structured databases querying, peer to peer systems, legacy document conversion, XML information retrieval. In the paper, we define the restructuration problem from a document centric perspective and identify the main problems raised by this new problematic. We then consider two restructuration instances: structuring flat documents and learning the correspondence between structured formats. We propose stochastic models for these two tasks and describe tests on a large XML document collection.
Contributor : Ludovic Denoyer <>
Submitted on : Tuesday, August 30, 2016 - 10:17:29 AM
Last modification on : Thursday, March 21, 2019 - 2:18:56 PM


  • HAL Id : hal-01357589, version 1


Patrick Gallinari, Guillaume Wisniewski, Francis Maes, Ludovic Denoyer. Stochastic models for document restructuration. ECML'05 Workshop on Relationnal Machine Learning, Oct 2005, Porto, Portugal. ⟨hal-01357589⟩



