Elephant: Sequence Labeling for Word and Sentence Segmentation - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2013

Elephant: Sequence Labeling for Word and Sentence Segmentation

Résumé

Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models .
Fichier principal
Vignette du fichier
D13-1146.pdf (129.32 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01344500 , version 1 (12-07-2016)

Identifiants

  • HAL Id : hal-01344500 , version 1

Citer

Kilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos. Elephant: Sequence Labeling for Word and Sentence Segmentation. EMNLP 2013, Oct 2013, Seattle, United States. ⟨hal-01344500⟩
412 Consultations
349 Téléchargements

Partager

Gmail Facebook X LinkedIn More