P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Transactions of the Association for Computational Linguistics, vol.2, pp.67-78, 2014.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft coco: Common objects in context, Computer Vision-ECCV 2014, pp.740-755, 2014.

D. Harwath and J. Glass, Deep multimodal semantic embeddings for speech and images, IEEE Automatic Speech Recognition and Understanding Workshop, pp.237-244, 2015.

G. Chrupa?a, L. Gelderloos, and A. Alishahi, Representations of language in a model of visually grounded speech signal, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.613-622, 2017.

W. Havard, L. Besacier, and O. Rosec, Speech-coco: 600k visually grounded spoken captions aligned to mscoco data set, Proc. GLU 2017 International Workshop on Grounding Language Understanding, pp.42-46, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01580879

Y. Yoshikawa, Y. Shigeto, and A. Takeuchi, Stair captions: Constructing a large-scale japanese image caption dataset, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol.2, pp.417-421, 2017.

A. Alishahi, M. Barking, and G. Chrupa?a, Encoding of phonology in a recurrent neural model of grounded speech, Proceedings of the 21st Conference on Computational Natural Language Learning, pp.368-378, 2017.

D. Harwath, G. Chuang, and J. R. Glass, Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4969-4973, 2018.

D. Harwath, D. Adrì-a-recasens, G. Surís, A. Chuang, J. Torralba et al., Jointly discovering visual objects and spoken words from raw sensory input, Computer Vision-ECCV 2018, pp.659-677, 2018.

S. Gella, R. Sennrich, F. Keller, and M. Lapata, Image pivoting for learning multilingual multimodal representations, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.2839-2845, 2017.

G. ´-akos-kádár, A. Chrupa?a, and . Alishahi, Representation of linguistic form and function in recurrent neural networks, Comput. Linguist, vol.43, issue.4, pp.761-780, 2017.

D. Harwath and J. Glass, Learning word-like units from joint audio-visual analysis, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol.1, pp.506-517, 2017.

H. Kamper, S. Settle, G. Shakhnarovich, and K. Livescu, Visually grounded learning of keyword prediction from untranscribed speech, 2017.

D. F. Harwath, A. Torralba, and J. R. Glass, Unsupervised learning of spoken language with visual context, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pp.1858-1866, 2016.

E. Dupoux, Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner, Cognition, vol.173, pp.43-59, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01888694

E. J. Gibson, Principles of perceptual learning and development, The century psychology series, 1969.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, Proceedings of ICLR 2015, pp.1-14, 2015.

M. Mcauliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, Montreal forced aligner: Trainable text-speech alignment using kaldi, 2017.

T. Kisler, U. Reichel, and F. Schiel, Multilingual processing of speech via web services, Computer Speech & Language, vol.45, pp.326-347, 2017.

H. Schmid, Probabilistic part-of-speech tagging using decision trees, Studies in Computational Linguistics, pp.154-164, 1997.

G. Neubig, Y. Nakata, and S. Mori, Pointwise prediction for robust, adaptable japanese morphological analysis, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.529-533, 2011.

S. Petrov, D. Das, and R. Mcdonald, A universal part-of-speech tagset, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), 2012.

D. Gentner, Why nouns are learned before verbs: Linguistic relativity versus natural partitioning, Language, vol.2, pp.301-334, 1982.

E. Haryu and S. Kajikawa, Use of bound morphemes (noun particles) in word segmentation by japaneselearning infants, Journal of Memory and Language, vol.88, issue.C, pp.18-27, 2016.

C. A. Ferguson and D. I. Slobin, Studies of child language development, 1973.