P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Transactions of the Association for Computational Linguistics, vol.2, pp.67-78, 2014.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common Objects in Context, European Conference on Computer Vision (ECCV)
DOI : 10.1007/978-3-319-10602-1_48

URL : http://arxiv.org/abs/1405.0312

J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, Attention-Based Models for Speech Recognition, Advances in Neural Information Processing Systems (NIPS 2015), pp.577-585, 2015.

A. Karpathy and F. Li, Deep visual-semantic alignments for generating image descriptions Available: https, IEEE Conference on Computer Vision and Pattern Recognition, pp.3128-3137, 2015.
DOI : 10.1109/tpami.2016.2598339

URL : http://arxiv.org/abs/1412.2306

L. Wang, Y. Li, and S. Lazebnik, Learning deep structurepreserving image-text embeddings
DOI : 10.1109/cvpr.2016.541

URL : http://arxiv.org/abs/1511.06078

D. Roy, Grounded spoken language acquisition: experiments in word learning, IEEE Transactions on Multimedia, vol.5, issue.2, pp.197-209, 2003.
DOI : 10.1109/TMM.2003.811618

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.5575

R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem et al., Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures (Extended Abstract), Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp.409-442, 2016.
DOI : 10.24963/ijcai.2017/704

Q. Wu, D. Teney, P. Wang, C. Shen, A. R. Dick et al., Visual question answering: A survey of methods and datasets, Computer Vision and Image Understanding
DOI : 10.1016/j.cviu.2017.05.001

D. Harwath and J. Glass, Deep multimodal semantic embeddings for speech and images, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.237-244, 2015.
DOI : 10.1109/ASRU.2015.7404800

URL : http://arxiv.org/abs/1511.03690

D. F. Harwath and J. R. Glass, Learning Word-Like Units from Joint Audio-Visual Analysis, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
DOI : 10.18653/v1/P17-1047

H. Kamper, S. Settle, G. Shakhnarovich, and K. Livescu, Visually Grounded Learning of Keyword Prediction from Untranscribed Speech, Interspeech 2017
DOI : 10.21437/Interspeech.2017-502

M. Hodosh, P. Young, and J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, J. Artif. Int. Res, vol.47, issue.1, pp.853-899, 2013.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, International Journal of Computer Vision, vol.2, issue.1???2, 2016.
DOI : 10.1109/CVPR.2013.387

URL : http://doi.org/10.1007/s11263-016-0981-7

G. Chrupa?a, L. Gelderloos, and A. Alishahi, Representations of language in a model of visually grounded speech signal, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017.
DOI : 10.18653/v1/P17-1057

D. Schwarz, Corpus-Based Concatenative Synthesis, IEEE Signal Processing Magazine, vol.24, issue.2, pp.92-104, 2007.
DOI : 10.1109/MSP.2007.323274

URL : https://hal.archives-ouvertes.fr/hal-01161253

H. Bortfeld, S. Leon, J. Bloom, M. Schober, and S. Brennan, Disfluency Rates in Conversation: Effects of Age, Relationship, Topic, Role, and Gender, Language and Speech, vol.35, issue.2, pp.123-147, 2001.
DOI : 10.1044/jshr.3504.782

A. Jansen and B. Van-durme, Efficient spoken term discovery using randomized algorithms, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011.
DOI : 10.1109/ASRU.2011.6163965

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.4390

B. Ludusan, M. Versteegh, A. Jansen, G. Gravier, X. Cao et al., Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems Available: https, 2014.

T. Miyazaki and N. Shimizu, Cross-lingual image caption generation Long Papers) Association for Computational Linguistics, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp.1780-1790, 2016.

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta et al., Microsoft COCO captions: Data collection and evaluation server, 2015.