An Unsupervised Morphological Criterion for Discriminating Similar Languages - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

An Unsupervised Morphological Criterion for Discriminating Similar Languages

Résumé

In this study conducted on the occasion of the Discriminating between Similar Languages shared task, I introduce an additional decision factor focusing on the token and subtoken level. The motivation behind this submission is to test whether a morphologically-informed criterion can add linguistically relevant information to global categorization and thus improve performance. The contributions of this paper are (1) a description of the unsupervised, low-resource method; (2) an evaluation and analysis of its raw performance; and (3) an assessment of its impact within a model comprising common indicators used in language identification. I present and discuss the systems used in the task A, a 12-way language identification task comprising varieties of five main language groups. Additionally I introduce a new off-the-shelf Naive Bayes classifier using a contrastive word and subword n-gram model ("Bayesline") which outperforms the best submissions.
Fichier principal
Vignette du fichier
ABarbaresi_Morphological-Criterion_DSL16.pdf (140.8 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-01575653 , version 1 (21-08-2017)

Licence

Paternité

Identifiants

  • HAL Id : hal-01575653 , version 1

Citer

Adrien Barbaresi. An Unsupervised Morphological Criterion for Discriminating Similar Languages. 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2016), Dec 2016, Osaka, Japan. pp.212-220. ⟨hal-01575653⟩
91 Consultations
59 Téléchargements

Partager

Gmail Facebook X LinkedIn More