Multistructured documents: from modelling to multidimensional analyses - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue International Journal of Competitive Intelligence, Strategic, Scientific and Technology Watch ( SCI&WATCH ) Année : 2009

Multistructured documents: from modelling to multidimensional analyses

Résumé

With the recent development of new information and communication technologies, the paper documents are transformed to digital documents. Furthermore, it considers that the document is no longer seen as a whole, or as a monolithic bloc, but as organized entities. Exploiting these documents amount to identify and locate these entities. These entities are connected by relationships to give a "form" to document. Several types of relationships may occur, so that several "forms" of a document emerge. These different materializations of the same document are related to different uses of the same document and are essential for optimal management and shared of holdings. The work presented in this thesis aims to address the challenges of representing different materializations of a document through its representation of entities and their relationships. If those materializations are translated through structures, the issues are related to the representation of multistructured documents. Our work focuses mainly on the modeling, integration and exploitation of multistructured documents: (1) Proposal of multistructured document model. This model incorporates two levels of description: a specific level to describe each document through entities that compose and a generic level to identify document kinds through the grouping of similar structures. (2) Proposal of techniques for extracting structure (implicit or explicit) of a document (the specific level) and classification of this structure with respect to common structures (the generic level). The classification algorithm proposed includes a calculation of distance called "structural" (comparison of trees and graphs). This classification is associated with a process of verification of the "cohesion" of classes and possible reorganization of disrupted classes. (3) Proposal of document exploitation technical from their structures and their contents: (a) a document search that can reproduce documentary granules through criteria based on research of structures and / or content, (b) a multidimensional analysis that is to analyze and visualize the documentary information across multiple dimensions (of structures and / or content). In order to validate our proposals, we have developed a tool for integration and analysis of multistructured documents, called MDOCREP (Multistructured Document Repository). This tool provides on the one hand, the extraction and classification of document structures, and on the other hand, the querying and the multidimensional analysis of documents from their different structures.
Fichier non déposé

Dates et versions

hal-03614928 , version 1 (21-03-2022)

Identifiants

  • HAL Id : hal-03614928 , version 1

Citer

Karim Djemal, Chantal Soulé-Dupuy, Nathalie Vallès-Parlangeau. Multistructured documents: from modelling to multidimensional analyses. International Journal of Competitive Intelligence, Strategic, Scientific and Technology Watch ( SCI&WATCH ), 2009. ⟨hal-03614928⟩
11 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More