Détection, localisation et typage de texte dans des images de documents hétérogènes par Réseaux de Neurones Profonds

Abstract : Being able to automatically read the texts written in documents, both printed and handwritten, makes it possible to access the information they convey. In order to realize full page text transcription, the detection and localization of the text lines is a crucial step. Traditional methods tend to use image processing based approaches, but they hardly generalize to very heterogeneous datasets. In this thesis, we propose to use a deep neural network based approach. We first propose a mono-dimensional segmentation of text paragraphs into lines that uses a technique inspired by the text recognition models. The connexionist temporal classification (CTC) method is used to implicitly align the sequences. Then, we propose a neural network that directly predicts the coordinates of the boxes bounding the text lines. Adding a confidence prediction to these hypothesis boxes enables to locate a varying number of objects. We propose to predict the objects locally in order to share the network parameters between the locations and to increase the number of different objects that each single box predictor sees during training. This compensates the rather small size of the available datasets. In order to recover the contextual information that carries knowledge on the document layout, we add multi-dimensional LSTM recurrent layers between the convolutional layers of our networks. We propose three full page text recognition strategies that tackle the need of high preciseness of the text line position predictions. We show on the heterogeneous Maurdor dataset how our methods perform on documents that can be printed or handwritten, in French, English or Arabic and we favourably compare to other state of the art methods. Visualizing the concepts learned by our neurons enables to underline the ability of the recurrent layers to convey the contextual information.
Document type :
Complete list of metadatas

Contributor : Christian Wolf <>
Submitted on : Friday, November 23, 2018 - 2:05:19 PM
Last modification on : Friday, May 17, 2019 - 10:31:58 AM


Files produced by the author(s)


  • HAL Id : tel-01932920, version 1


Bastien Moysset. Détection, localisation et typage de texte dans des images de documents hétérogènes par Réseaux de Neurones Profonds. Traitement du texte et du document. Université de Lyon, 2018. Français. ⟨NNT : 2018LYSEI044⟩. ⟨tel-01932920⟩



Record views


Files downloads