Restructuring Unstructured Documents: On the use of smart and semi-automatic interfaces to structure unstructured data

Abstract : Every day, the volume of the world's digital data increases considerably. Over 75% of these data are non-structured. This paper is about restructuring graphic information contained in Portable Document Format (PDF) files and/or vector files. These documents are generally held by ''Smart Factory'' services: design offices, methods departments, new work departments and company maintenance services. To restructure these data, we propose using Knowledge Discovery in Databases (KDD) methods. Although, theoretically, the user is present during the KDD, in practice, this is not the case. This was observed by Fayard in 2003 at the KDD conference. Generally, the user is only present during the validation phase. We show why, in data restructuring, the user must be at the heart of the process and present at all stages. We can talk about (A)KDD for the Anthropocentric Knowledge Discovery in Databases .The first stage of this restructuring consists of extracting graphic and text objects contained in Portable Document Format (PDF) files to put them in a pivot data format. The second stage consists of coding this information in the form of an alphabet. The third stage consists of recreating the graphic and text components which are repeated in these files (which we shall refer to as graphemes). And the fourth stage consists either (1) of automatically identifying these graphemes based on knowledge or (2) presenting them so the user identifies and introduces them into the knowledge base. It is this entire restructuring process, which we will describe in this paper. As we highlighted, in this incremental process it is people who play the main role, assisted by computers and not the opposite.
Complete list of metadatas

Cited literature [24 references]  Display  Hide  Download
Contributor : Nadine Couture <>
Submitted on : Saturday, January 6, 2018 - 7:55:34 PM
Last modification on : Wednesday, July 3, 2019 - 11:18:02 AM


Publisher files allowed on an open archive




  • HAL Id : hal-01653656, version 2



Jacques Péré-Laperne, Nadine Couture. Restructuring Unstructured Documents: On the use of smart and semi-automatic interfaces to structure unstructured data. SMART INTERFACES 2017, The Symposium for Empowering and Smart Interfaces in Engineering , Jun 2017, Venice, Italy. pp.60-65. ⟨hal-01653656v2⟩



Record views


Files downloads