Skip to Main content Skip to Navigation
Theses

Extraction d’information dans des documents manuscrits anciens

Adeline Granet 1, 2, 3
2 IPI - Image Perception Interaction
LS2N - Laboratoire des Sciences du Numérique de Nantes
3 TALN - Traitement Automatique du Langage Naturel
LS2N - Laboratoire des Sciences du Numérique de Nantes
Abstract : Exploring unexploited but newly digitized resources to find relevant information is a complicated task due to the amount of available resources. Thanks to the ANR project CIRESFI, the most important resource for the Italian Comedy of the 18th century, is a set of accounting registers consisting of 28,000 pages. Information retrieval is a long and complex process that requires expertise at every step: detection and segmentation in paragraphs, lines or words, features extraction, handwriting recognition. Systems based on deep neural networks dominate these approaches. The major issue is the need of a large amount of data to achieve their learning. However, the registers of the Italian Comedy have no ground truth. To overcome this lack of data, we explore approaches that involving transfer learning. That means using heterogeneous labeled and available data, with at least one common feature with our data to drive the systems, and then applying them to our data. All of our experiments have shown us the difficulty of carrying out this task, each choice at each stage having a strong impact on the rest of the system. We converge on a solution separating the optical model from the language model in order to achieve independent learning with different available resources and joining together thanks to a projection of the information into a non-latent common space.
Complete list of metadata

Cited literature [259 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/tel-02925118
Contributor : Harold Mouchère <>
Submitted on : Friday, August 28, 2020 - 4:55:22 PM
Last modification on : Tuesday, January 5, 2021 - 4:26:09 PM

File

GRANET.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-02925118, version 1

Citation

Adeline Granet. Extraction d’information dans des documents manuscrits anciens. Traitement des images [eess.IV]. Université de Nantes, 2018. Français. ⟨tel-02925118⟩

Share

Metrics

Record views

154

Files downloads

250