Skip to Main content Skip to Navigation

Automatic reconstruction of itineraries from descriptive texts

Abstract : This PhD thesis is part of the research project PERDIDO, which aims at extracting and retrieving displacements from textual documents. This work was conducted in collaboration with the LIUPPA laboratory of the university of Pau (France), the IAAA team of the university of Zaragoza (Spain) and the COGIT laboratory of IGN (France). The objective of this PhD is to propose a method for establishing a processing chain to support the geoparsing and geocoding of text documents describing events strongly linked with space. We propose an approach for the automatic geocoding of itineraries described in natural language. Our proposal is divided into two main tasks. The first task aims at identifying and extracting information describing the itinerary in texts such as spatial named entities and expressions of displacement or perception. The second task deal with the reconstruction of the itinerary. Our proposal combines local information extracted using natural language processing and physical features extracted from external geographical sources such as gazetteers or datasets providing digital elevation models. The geoparsing part is a Natural Language Processing approach which combines the use of part of speech and syntactico-semantic combined patterns (cascade of transducers) for the annotation of spatial named entities and expressions of displacement or perception. The main contribution in the first task of our approach is the toponym disambiguation which represents an important issue in Geographical Information Retrieval (GIR). We propose an unsupervised geocoding algorithm that takes profit of clustering techniques to provide a solution for disambiguating the toponyms found in gazetteers, and at the same time estimating the spatial footprint of those other fine-grain toponyms not found in gazetteers. We propose a generic graph-based model for the automatic reconstruction of itineraries from texts, where each vertex represents a location and each edge represents a path between locations. %, combining information extracted from texts and information extracted from geographical databases. Our model is original in that in addition to taking into account the classic elements (paths and waypoints), it allows to represent the other elements describing an itinerary, such as features seen or mentioned as landmarks. To build automatically this graph-based representation of the itinerary, our approach computes an informed spanning tree on a weighted graph. Each edge of the initial graph is weighted using a multi-criteria analysis approach combining qualitative and quantitative criteria. Criteria are based on information extracted from the text and information extracted from geographical sources. For instance, we compare information given in the text such as spatial relations describing orientation (e.g., going south) with the geographical coordinates of locations found in gazetteers. Finally, according to the definition of an itinerary and the information used in natural language to describe itineraries, we propose a markup langugage for encoding spatial and motion information based on the Text Encoding and Interchange guidelines (TEI) which defines a standard for the representation of texts in digital form. Additionally, the rationale of the proposed approach has been verified with a set of experiments on a corpus of multilingual hiking descriptions (French, Spanish and Italian).
Complete list of metadata

Cited literature [247 references]  Display  Hide  Download
Contributor : Ludovic Moncla Connect in order to contact the contributor
Submitted on : Monday, January 4, 2016 - 11:17:49 AM
Last modification on : Tuesday, May 4, 2021 - 2:54:18 PM


  • HAL Id : tel-01249999, version 1



Ludovic Moncla. Automatic reconstruction of itineraries from descriptive texts. Information Retrieval [cs.IR]. Université de Pau et des Pays de l'Adour; Universidad de Zaragoza, 2015. English. ⟨tel-01249999⟩



Les métriques sont temporairement indisponibles