Skip to Main content Skip to Navigation
Conference papers

Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models

Abstract : Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neu-ral networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.
Document type :
Conference papers
Complete list of metadata

Cited literature [30 references]  Display  Hide  Download
Contributor : Laurent Besacier Connect in order to contact the contributor
Submitted on : Friday, September 22, 2017 - 3:42:59 PM
Last modification on : Tuesday, May 11, 2021 - 11:37:20 AM
Long-term archiving on: : Saturday, December 23, 2017 - 2:02:32 PM


Files produced by the author(s)


  • HAL Id : hal-01592091, version 1


Marcely Zanon Boito, Alexandre Bérard, Aline Villavicencio, Laurent Besacier. Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models. IEEE Automatic Speech Recognition and Understanding (ASRU), Dec 2017, Okinawa, Japan. ⟨hal-01592091⟩



Record views


Files downloads