Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models

Abstract : Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neu-ral networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.
Type de document :
Communication dans un congrès
IEEE Automatic Speech Recognition and Understanding (ASRU), Dec 2017, Okinawa, Japan
Liste complète des métadonnées

Littérature citée [30 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01592091
Contributeur : Laurent Besacier <>
Soumis le : vendredi 22 septembre 2017 - 15:42:59
Dernière modification le : jeudi 11 octobre 2018 - 08:48:03
Document(s) archivé(s) le : samedi 23 décembre 2017 - 14:02:32

Fichier

Template.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01592091, version 1

Citation

Marcely Zanon Boito, Alexandre Bérard, Aline Villavicencio, Laurent Besacier. Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models. IEEE Automatic Speech Recognition and Understanding (ASRU), Dec 2017, Okinawa, Japan. 〈hal-01592091〉

Partager

Métriques

Consultations de la notice

398

Téléchargements de fichiers

107