Elephant: Sequence Labeling for Word and Sentence Segmentation

Kilian Evang; Valerio Basile; Grzegorz Chrupała; Johan Bos

Communication Dans Un Congrès Année : 2013

Elephant: Sequence Labeling for Word and Sentence Segmentation

(1) , (1) , (2) , (1)

1
2

Kilian Evang

Fonction : Auteur

University of Groningen [Groningen]

Valerio Basile

Fonction : Auteur

University of Groningen [Groningen]

Grzegorz Chrupała

Fonction : Auteur

Tilburg University [Netherlands]

Johan Bos

Fonction : Auteur

University of Groningen [Groningen]

Résumé

Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models .

Mots clés

Tokenization Sequence labeling Machine learning Deep learning segmentation

Domaines

Linguistique Informatique et langage [cs.CL] Traitement du texte et du document

Fichier principal

D13-1146.pdf (129.32 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Valerio Basile : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01344500

Soumis le : mardi 12 juillet 2016-09:39:49

Dernière modification le : lundi 9 octobre 2017-13:18:03

Dates et versions

hal-01344500 , version 1 (12-07-2016)

Identifiants

HAL Id : hal-01344500 , version 1

Citer

Kilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos. Elephant: Sequence Labeling for Word and Sentence Segmentation. EMNLP 2013, Oct 2013, Seattle, United States. ⟨hal-01344500⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

412 Consultations

349 Téléchargements

Elephant: Sequence Labeling for Word and Sentence Segmentation

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager