Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing - Archive ouverte HAL Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2018

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

Résumé

Addressing the cross-lingual variation of grammatical structures and meaning categorization is a key challenge for multilingual Natural Language Processing. The lack of resources for the majority of the world's languages makes supervised learning not viable. Moreover, the performance of most algorithms is hampered by language-specific biases and the neglect of informative multilingual data. The discipline of Linguistic Typology provides a principled framework to compare languages systematically and empirically and documents their variation in publicly available databases. These enshrine crucial information to design language-independent algorithms and refine techniques devised to mitigate the above-mentioned issues, including cross-lingual transfer and multilingual joint models, with typological features. In this survey, we demonstrate that typology is beneficial to several NLP applications, involving both semantic and syntactic tasks. Moreover, we outline several techniques to extract features from databases or acquire them automatically: these features can be subsequently integrated into multilingual models to tie parameters together cross-lingually or gear a model towards a specific language. Finally, we advocate for a new typology that accounts for the patterns within individual examples rather than entire languages, and for graded categories rather than discrete ones, in oder to bridge the gap with the contextual and continuous nature of machine learning algorithms.
Fichier principal
Vignette du fichier
1807.00914.pdf (2.81 Mo) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-02425462 , version 1 (09-08-2018)
hal-02425462 , version 2 (24-10-2020)

Identifiants

  • HAL Id : hal-02425462 , version 1

Citer

Edoardo Maria Ponti, Helen O 'Horan, Yevgeni Berzak, Ivan Vulic, Roi Reichart, et al.. Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing. 2018. ⟨hal-02425462v1⟩
268 Consultations
964 Téléchargements

Partager

Gmail Facebook X LinkedIn More