Predicative Analysis for Information Extraction: application to the biology domain

Zorana Ratkovic

Thèse Année : 2014

Predicative Analysis for Information Extraction: application to the biology domain

Analyse prédicative pour l’extraction d’information : application au domaine de la biologie

(1, 2, 3)

1
2
3

Zorana Ratkovic

Fonction : Auteur

Unité Mathématique Informatique et Génome

Lattice - Langues, Textes, Traitements informatiques, Cognition - UMR 8094

Université Sorbonne Nouvelle - Paris 3

Résumé

The abundance of biomedical information expressed in natural language has resulted in the need for methods to process this information automatically. In the field of Natural Language Processing (NLP), Information Extraction (IE) focuses on the extraction of relevant information from unstructured data in natural language. A great deal of IE methods today focus on Machine Learning (ML) approaches that rely on deep linguistic processing in order to capture the complex information contained in biomedical texts. In particular, syntactic analysis and parsing have played an important role in IE, by helping capture how words in a sentence are related. This thesis examines how dependency parsing can be used to facilitate IE. It focuses on a task-based approach to dependency parsing evaluation and parser selection, including a detailed error analysis. In order to achieve a high quality of syntax-based IE, different stages of linguistic processing are addressed, including both pre-processing steps (such as tokenization) and the use of complementary linguistic processing (such as the use of semantics and coreference analysis). This thesis also explores how the different levels of linguistics processing can be represented for use within an ML-based IE algorithm, and how the interface between these two is of great importance. Finally, biomedical data is very heterogeneous, encompassing different subdomains and genres. This thesis explores how subdomain adaptation can be achieved by using already existing subdomain knowledge and resources. The methods and approaches described are explored using two different biomedical corpora, demonstrating how the IE results are used in real-life tasks.

La thèse s'inscrit dans le contexte décrit précédemment : il s'agit d'explorer des techniques d'acquisition de connaissances lexicales à partir de textes, à des fins tant théoriques qu'applicatives. l'analyse portera plus particulièrement sur le prédicat verbal et ses nominalisations car celui-ci joue un rôle essentiel pour les applications de tal (repérage d'événements, extraction d'information, etc.). on s'intéressera par exemple à l'acquisition de cadres de sous-catégorisation et de restrictions de sélections afin de déterminer des familles de verbes ayant un comportement syntaxico-sémantique proche. la stratégie envisagée est fortement inspirée des travaux de z. harris et de ses collègues (harris 1951, 1988 ; harris et al., 1989). celui-ci a montré que les textes techniques n'utilisent pas toute la complexité de la langue mais font au contraire usage de « sous-langages ». un sous-langage a un vocabulaire spécialisé et une syntaxe simplifiée par rapport à la langue courante. les textes de spécialités font donc apparaître des régularités qui peuvent s'analyser par analyse distributionnelle (en simplifiant : les éléments apparaissant dans des contextes similaires ont des sens similaires, ou tout au moins proches). seulement, l'analyse distributionnelle en peut fonctionner que si le texte a été « nettoyé » des variations linguistiques de surface. une pré-analyse des textes est donc cruciale.

Mots clés

BioNLP

extraction d'information extraction de relation analyse syntaxique en dépendances Traitement automatique de la langue

Domaines

Intelligence artificielle [cs.AI] Traitement du texte et du document Bio-informatique [q-bio.QM] Apprentissage [cs.LG] Interface homme-machine [cs.HC] Informatique ubiquitaire Informatique et théorie des jeux [cs.GT]

Migration ProdInra : Connectez-vous pour contacter le contributeur

https://hal.inrae.fr/tel-02796506

Soumis le : vendredi 5 juin 2020-13:45:36

Dernière modification le : vendredi 19 avril 2024-16:18:57

Dates et versions

tel-02796506 , version 1 (05-06-2020)

Identifiants

HAL Id : tel-02796506 , version 1
PRODINRA : 279517

Citer

Zorana Ratkovic. Predicative Analysis for Information Extraction: application to the biology domain. Artificial Intelligence [cs.AI]. 2014. English. ⟨NNT : ⟩. ⟨tel-02796506⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS UNIV-PARIS3 INRA LATTICE THESES-ENS PSL INRAE MATHNUM

36 Consultations

0 Téléchargements

Predicative Analysis for Information Extraction: application to the biology domain

Analyse prédicative pour l’extraction d’information : application au domaine de la biologie

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager