Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus
Résumé
This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al., 2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as a supervised task; (2) language scalability requires a streamlined software engineering process across languages.
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...