An Unsupervised Morphological Criterion for Discriminating Similar Languages

Adrien Barbaresi

Communication Dans Un Congrès Année : 2016

An Unsupervised Morphological Criterion for Discriminating Similar Languages

(1, 2)

1
2

Adrien Barbaresi

Fonction : Auteur
PersonId : 1134
IdHAL : adrien-barbaresi
ORCID : 0000-0002-8079-8694

Berlin-Brandenburgische Akademie der Wissenschaften

Austrian Academy of Sciences

Résumé

In this study conducted on the occasion of the Discriminating between Similar Languages shared task, I introduce an additional decision factor focusing on the token and subtoken level. The motivation behind this submission is to test whether a morphologically-informed criterion can add linguistically relevant information to global categorization and thus improve performance. The contributions of this paper are (1) a description of the unsupervised, low-resource method; (2) an evaluation and analysis of its raw performance; and (3) an assessment of its impact within a model comprising common indicators used in language identification. I present and discuss the systems used in the task A, a 12-way language identification task comprising varieties of five main language groups. Additionally I introduce a new off-the-shelf Naive Bayes classifier using a contrastive word and subword n-gram model ("Bayesline") which outperforms the best submissions.

Mots clés

Language identification Language detection Morphological analysis Dialectal differences

Domaines

Linguistique Informatique et langage [cs.CL]

Fichier principal

ABarbaresi_Morphological-Criterion_DSL16.pdf (140.8 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Adrien Barbaresi : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01575653

Soumis le : lundi 21 août 2017-13:41:06

Dernière modification le : mercredi 12 décembre 2018-13:32:04

Dates et versions

hal-01575653 , version 1 (21-08-2017)

Licence

Paternité

Identifiants

HAL Id : hal-01575653 , version 1

Citer

Adrien Barbaresi. An Unsupervised Morphological Criterion for Discriminating Similar Languages. 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2016), Dec 2016, Osaka, Japan. pp.212-220. ⟨hal-01575653⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

91 Consultations

59 Téléchargements

An Unsupervised Morphological Criterion for Discriminating Similar Languages

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Partager