Skip to Main content Skip to Navigation
Conference papers

Apprentissage de plongements de mots sur des corpus en langue de spécialité : une étude d’impact

Abstract : Word embedding approaches are state of the art in Natural Language Processing (NLP). In this work, we focus on learning word embeddings for small domain-specific corpora. In particular, we would like to know whether word embeddings learnt over large corpora such as Wikipedia perform better than word embeddings learnt on domain specific corpora. In order to answer this question, we consider two corpora : OHSUMED from the medical field, and SNCF, a technical documentation corpus. After presenting the corpora and evaluating their specificity, we introduce a classification task. We use word embeddings learnt on domain-specific corpora or Wikipedia as input for this task. Our analysis demonstrates that word embeddings learnt on Wikipedia achieve excellent results, even though, in the case of OHSUMED, domain specific word embeddings perform better.
Document type :
Conference papers
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-02786198
Contributor : Sylvain Pogodalla Connect in order to contact the contributor
Submitted on : Tuesday, February 2, 2021 - 3:22:08 PM
Last modification on : Wednesday, April 20, 2022 - 11:36:01 AM

Files

155.pdf
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-02786198, version 3

Citation

Valentin Pelloin, Thibault Prouteau. Apprentissage de plongements de mots sur des corpus en langue de spécialité : une étude d’impact. 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 3 : Rencontre des Étudiants Chercheurs en Informatique pour le TAL, Jun 2020, Nancy, France. pp.164-178. ⟨hal-02786198v3⟩

Share

Metrics

Record views

372

Files downloads

198