Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus

Erwan Moreau; Carl Vogel

Communication Dans Un Congrès Année : 2018

Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus

(1) , (1)

Erwan Moreau

Fonction : Auteur
PersonId : 21305
IdHAL : erwan-moreau
ORCID : 0000-0001-7692-526X

School of Computer Science and Statistics [Dublin]

Carl Vogel

Fonction : Auteur
PersonId : 960944

School of Computer Science and Statistics [Dublin]

Résumé

This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al., 2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as a supervised task; (2) language scalability requires a streamlined software engineering process across languages.

Mots clés

Universal Dependencies Word Segmentation Tokenization Multilinguality Interoperability

Domaines

Traitement du texte et du document Intelligence artificielle [cs.AI]

Fichier principal

1072.pdf (130.57 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Erwan Moreau : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01822151

Soumis le : samedi 23 juin 2018-22:08:38

Dernière modification le : mardi 18 septembre 2018-00:40:04

Archivage à long terme le : mercredi 26 septembre 2018-20:45:40

Dates et versions

hal-01822151 , version 1 (23-06-2018)

Identifiants

HAL Id : hal-01822151 , version 1

Citer

Erwan Moreau, Carl Vogel. Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2018, Miyazaki, Japan. ⟨hal-01822151⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

80 Consultations

293 Téléchargements

Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager