Combining Subword information and Language model for Information Retrieval

InformationRetrieval(IR)classicallyreliesonseveralprocessestoimproveperfor- mance of language modeling approaches. When considering semantic of words, Neural Word Embeddings (Mikolov et al., 2013) have been shown to catch semantic similarities between words. Such Distributed Representations represent terms in a dense vector space are efficiently learned from large corpora. Lately, they have been used to compute the translation probabilities between terms in the Neural Translation Language Model (NTLM) (Zuccon et al., 2015) frame- work for Information Retrieval in order to deal with the vocabulary mismatch issue. In this work, we propose to test this model with recent vectorial representations (Bojanowski et al., 2016) that take into account the internal structure of words.

En recherche d'information, certains procédés sont utilisés pour améliorer les performances des modèles de langue. Lorsque l'on considère la sémantique des mots, il a été montré que les plongements de mots neuronaux capturent des similarités sémantiques entre les mots (Mikolov et al., 2013). De telles représentations distribuées qui plongent les mots dans un espace vectoriel dense sont apprises de façon efficace sur de grandes collections. Récem-ment, elles ont été utilisées pour calculer les probabilités de traduction entre termes dans le cadre des modèles de langue neuronaux (Zuccon et al., 2015) pour la recherche d'information afin de gérer le problème de la disparité des termes. Dans cet article, nous proposons d'uti-liser de nouvelles représentations distribuées qui prennent en compte la structure interne des mots (Bojanowski et al., 2016) dans le cadre des modèles de langue neuronaux.

Mots clés

Information Retrieval Language Models Distributed Word Representations 2

Recherche d’information Modèle de langue Représentation Distribuée de Mots

Domaines

Informatique et langage [cs.CL] Recherche d'information [cs.IR] Traitement du texte et du document

Fichier principal

CORIA2018_Frej-et-al.pdf (238.5 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Didier Schwab : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01781181

Soumis le : dimanche 29 avril 2018-17:07:30

Dernière modification le : jeudi 4 avril 2024-21:26:56

Archivage à long terme le : jeudi 20 septembre 2018-04:36:49

Dates et versions

hal-01781181 , version 1 (29-04-2018)

Identifiants

HAL Id : hal-01781181 , version 1

Citer

Jibril Frej, Philippe Mulhem, Didier Schwab, Jean-Pierre Chevallet. Combining Subword information and Language model for Information Retrieval. 15e Conférence en Recherche d’Information et Applications, May 2018, Rennes, France. ⟨hal-01781181⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_TDCGE_GETALP LIG_TDCGE_MRIM LIG_SIDCH

179 Consultations

196 Téléchargements