A. Alishahi, M. Barking, and G. Chrupa?a, Encoding of phonology in a recurrent neural model of grounded speech, Proceedings of the 21st Conference on Computational Natural Language Learning, pp.368-378, 2017.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, 3rd International Conference on Learning Representations, 2015.

N. Carlini and D. A. Wagner, Audio adversarial examples: Targeted attacks on speechto-text, IEEE Security and Privacy Workshops (SPW), pp.1-7, 2018.

K. Cho, C. Bart-van-merrienboer, D. Gulcehre, F. Bahdanau, H. Bougares et al., Learning phrase representations using rnn encoder-decoder for statistical machine translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.1724-1734, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433235

G. Chrupa?a, L. Gelderloos, and A. Alishahi, Representations of language in a model of visually grounded speech signal, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.613-622, 2017.

G. Chrupa?a, L. Gelderloos, and A. Alishahi, Synthetically spoken coco, 2017.

S. Cotton and F. Grosjean, The gating paradigm: A comparison of successive and individual presentation formats, Perception & Psychophysics, vol.35, issue.1, pp.41-48, 1984.

D. Dahan and J. S. Magnuson, Chapter 8 -spoken word recognition, 2006.

M. A. Traxler and . Gernsbacher, Handbook of Psycholinguistics, pp.249-283

E. Dupoux, Cognitive science in the era of artificial intelligence: A roadmap for reverseengineering the infant language-learner, Cognition, vol.173, pp.43-59, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01888694

F. Grosjean, Spoken word recognition processes and the gating paradigm, Perception & Psychophysics, vol.28, issue.4, pp.267-283, 1980.

D. Harwath and J. Glass, Towards visually grounded sub-word speech unit discovery, ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.3017-3021, 2019.

D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba et al., Jointly discovering visual objects and spoken words from raw sensory input, Computer Vision -ECCV 2018, pp.659-677, 2018.

D. F. Harwath, A. Torralba, and J. R. Glass, Unsupervised learning of spoken language with visual context, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp.1858-1866, 2016.

W. N. Havard, J. Chevrot, and L. Besacier, Models of visually grounded speech signal pay attention to nouns: A bilingual experiment on english and japanese, ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.8618-8622, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02013984

A. Kádár, G. Chrupa?a, and A. Alishahi, Representation of linguistic form and function in recurrent neural networks, Comput. Linguist, vol.43, issue.4, pp.761-780, 2017.

H. Kamper, G. Shakhnarovich, and K. Livescu, Semantic speech retrieval with a visually grounded model of untranscribed speech, Speech, and Language Processing, vol.27, pp.89-98, 2019.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft coco: Common objects in context, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, pp.740-755, 2014.

D. William and . Marslen-wilson, Functional parallelism in spoken word-recognition, Special Issue Spoken Word Recognition, vol.25, pp.71-102, 1987.

A. William-d-marslen-wilson and . Welsh, Processing interactions and lexical access during word recognition in continuous speech, Cognitive Psychology, vol.10, issue.1, pp.29-63, 1978.

L. James, J. L. Mcclelland, and . Elman, The trace model of speech perception, Cognitive Psychology, vol.18, issue.1, pp.1-86, 1986.

D. Merkx, S. L. Frank, and M. Ernestus, Language Learning Using Speech to Image Retrieval, Proc. Interspeech, pp.1841-1845, 2019.

D. Norris, Shortlist: a connectionist model of continuous speech recognition, Cognition, vol.52, pp.189-234, 1994.

D. K. Roy and A. P. Pentland, Learning words from sights and sounds: a computational model, Cognitive Science, vol.26, issue.1, pp.113-146, 2002.

J. Su, D. V. Vargas, and K. Sakurai, One pixel attack for fooling deep neural networks, IEEE Transactions on Evolutionary Computation, pp.1-1, 2019.

A. Weber and O. Scharenborg, Models of spoken-word recognition, Wiley Interdisciplinary Reviews: Cognitive Science, vol.3, issue.3, pp.387-401, 2012.

Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre et al., Spoofing and countermeasures for speaker verification: a survey, 2014.

W. E. Zhang, Z. Quan, A. Sheng, F. Abdulrahmn, and . Alhazmi, Generating textual adversarial examples for deep learning models: A survey, 2019.