Hypergraphs and information fusion for term representation enrichment : applications to named entity recognition and word sense disambiguation

Abstract : Making sense of textual data is an essential requirement in order to make computers understand our language. To extract actionable information from text, we need to represent it by means of descriptors before using knowledge discovery techniques.The goal of this thesis is to shed light into heterogeneous representations of words and how to leverage them while addressing their implicit sparse nature.First, we propose a hypergraph network model that holds heterogeneous linguistic data in a single unified model. In other words, we introduce a model that represents words by means of different linguistic properties and links them together accordingto said properties. Our proposition differs to other types of linguistic networks in that we aim to provide a general structure that can hold several types of descriptive text features, instead of a single one as in most representations. This representationmay be used to analyze the inherent properties of language from different points of view, or to be the departing point of an applied NLP task pipeline. Secondly, we employ feature fusion techniques to provide a final single enriched representation that exploits the heterogeneous nature of the model and alleviates the sparseness of each representation.These types of techniques are regularly used exclusively to combine multimedia data. In our approach, we consider different text representations as distinct sources of information which can be enriched by themselves. This approach has not been explored before, to the best of our knowledge. Thirdly, we propose an algorithm that exploits the characteristics of the network to identify and group semantically related words by exploiting the real-world properties of the networks. In contrast with similar methods that are also based on the structure of the network, our algorithm reduces the number of required parameters and more importantly, allows for the use of either lexical or syntactic networks to discover said groups of words, instead of the singletype of features usually employed.We focus on two different natural language processing tasks: Word Sense Induction and Disambiguation (WSI/WSD), and Named Entity Recognition (NER). In total, we test our propositions on four different open-access datasets. The results obtained allow us to show the pertinence of our contributions and also give us some insights into the properties of heterogeneous features and their combinations with fusion methods. Specifically, our experiments are twofold: first, we show that using fusion-enriched heterogeneous features, coming from our proposed linguistic network, we outperform the performance of single features’ systems and other basic baselines. We note that using single fusion operators is not efficient compared to using a combination of them in order to obtain a final space representation. We show that the features added by each combined fusion operation are important towards the models predicting the appropriate classes. We test the enriched representations on both WSI/WSD and NER tasks. Secondly, we address the WSI/WSD task with our network-based proposed method. While based on previous work, we improve it by obtaining better overall performance and reducing the number of parameters needed. We also discuss the use of either lexical or syntactic networks to solve the task.Finally, we parse a corpus based on the English Wikipedia and then store it following the proposed network model. The parsed Wikipedia version serves as a linguistic resource to be used by other researchers. Contrary to other similar resources, insteadof just storing its part of speech tag and its dependency relations, we also take into account the constituency-tree information of each word analyzed. The hope is for this resource to be used on future developments without the need to compile suchresource from zero.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-01940801
Contributor : Abes Star <>
Submitted on : Friday, November 30, 2018 - 3:15:06 PM
Last modification on : Saturday, December 1, 2018 - 1:22:58 AM
Long-term archiving on : Friday, March 1, 2019 - 2:37:44 PM

Files

resumefr_internet_soriano_mora...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01940801, version 1

Citation

Edmundo-Pavel Soriano-Morales. Hypergraphs and information fusion for term representation enrichment : applications to named entity recognition and word sense disambiguation. Computation and Language [cs.CL]. Université de Lyon, 2018. English. ⟨NNT : 2018LYSE2009⟩. ⟨tel-01940801⟩

Share

Metrics

Record views

122

Files downloads

50