Skip to Main content Skip to Navigation
Conference papers

Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models

Abstract : Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neu-ral networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.
Complete list of metadatas

Cited literature [30 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01592091
Contributor : Laurent Besacier <>
Submitted on : Friday, September 22, 2017 - 3:42:59 PM
Last modification on : Monday, April 20, 2020 - 10:40:03 AM
Document(s) archivé(s) le : Saturday, December 23, 2017 - 2:02:32 PM

File

Template.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01592091, version 1

Citation

Marcely Zanon Boito, Alexandre Bérard, Aline Villavicencio, Laurent Besacier. Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models. IEEE Automatic Speech Recognition and Understanding (ASRU), Dec 2017, Okinawa, Japan. ⟨hal-01592091⟩

Share

Metrics

Record views

477

Files downloads

256