Skip to Main content Skip to Navigation
Theses

Extraction et Complétion de Terminologies Multilingues

Valérie Hanoka 1
1 ALPAGE - Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing
Inria Paris-Rocquencourt, UPD7 - Université Paris Diderot - Paris 7
Abstract : Until now, automatic terminology extraction techniques have been often targeted towards monolingual corpora that are homogeneous from a language register point of view. This work, carried out in the context of a CIFRE convention, extends this objective to non-edited textual data written in typologically diverse languages, in order to extract « field terms ». This work focuses on the analysis of verbatim produced in the context of employee surveys carried out within multinational companies and processed by the Verbatim Analysis - VERA company. It involves the design and development of a processing pipeline for automatically extracting terminologies in a virtually language-independent, register-independent and domain-independent way. Based on an assessment of the typological properties of seven diverse languages, we propose a preliminary text pre-processing step prepares the training of models. This step is partly necessary (tokenization) and partly optional (removal of part of the morphological information). We compute from the resulting data a series of numerical features (statistical and frequency-based) used for training statistical models (CRFs). We select a first set of best models by means of an automatic dedicated evaluation of the extracted terms produced in each of the experimental settings considered for each languages. We then carry out a second series of evaluations for assessing the usability of these models on languages that differ from their training languages. Our results tend to demonstrate that the quality of the field terms that we extract is satisfying. The best scores we obtain (in a monolingual setting) are above 0, 9 for most languages. These scores can even be further improved for several languages by using some of the best models trained on other languages ; as a result, our approach could prove useful for extracting terminologies in languages for which such models are not available.
Complete list of metadatas

Cited literature [321 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/tel-01257201
Contributor : Valérie Hanoka <>
Submitted on : Friday, January 15, 2016 - 9:59:03 PM
Last modification on : Friday, March 27, 2020 - 2:58:09 AM

Identifiers

  • HAL Id : tel-01257201, version 1

Collections

Citation

Valérie Hanoka. Extraction et Complétion de Terminologies Multilingues. Linguistique. Université Paris Diderot (Paris 7), 2015. Français. ⟨tel-01257201⟩

Share

Metrics

Record views

314

Files downloads

962