From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Transactions of the Association for Computational Linguistics, vol.2, pp.67-78, 2014. ,
Microsoft COCO: Common Objects in Context, European Conference on Computer Vision (ECCV) ,
DOI : 10.1007/978-3-319-10602-1_48
URL : http://arxiv.org/abs/1405.0312
Attention-Based Models for Speech Recognition, Advances in Neural Information Processing Systems (NIPS 2015), pp.577-585, 2015. ,
Deep visual-semantic alignments for generating image descriptions Available: https, IEEE Conference on Computer Vision and Pattern Recognition, pp.3128-3137, 2015. ,
DOI : 10.1109/tpami.2016.2598339
URL : http://arxiv.org/abs/1412.2306
Learning deep structurepreserving image-text embeddings ,
DOI : 10.1109/cvpr.2016.541
URL : http://arxiv.org/abs/1511.06078
Grounded spoken language acquisition: experiments in word learning, IEEE Transactions on Multimedia, vol.5, issue.2, pp.197-209, 2003. ,
DOI : 10.1109/TMM.2003.811618
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.5575
Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures (Extended Abstract), Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp.409-442, 2016. ,
DOI : 10.24963/ijcai.2017/704
Visual question answering: A survey of methods and datasets, Computer Vision and Image Understanding ,
DOI : 10.1016/j.cviu.2017.05.001
Deep multimodal semantic embeddings for speech and images, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.237-244, 2015. ,
DOI : 10.1109/ASRU.2015.7404800
URL : http://arxiv.org/abs/1511.03690
Learning Word-Like Units from Joint Audio-Visual Analysis, Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) ,
DOI : 10.18653/v1/P17-1047
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech, Interspeech 2017 ,
DOI : 10.21437/Interspeech.2017-502
Framing image description as a ranking task: Data, models and evaluation metrics, J. Artif. Int. Res, vol.47, issue.1, pp.853-899, 2013. ,
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, International Journal of Computer Vision, vol.2, issue.1???2, 2016. ,
DOI : 10.1109/CVPR.2013.387
URL : http://doi.org/10.1007/s11263-016-0981-7
Representations of language in a model of visually grounded speech signal, Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), 2017. ,
DOI : 10.18653/v1/P17-1057
Corpus-Based Concatenative Synthesis, IEEE Signal Processing Magazine, vol.24, issue.2, pp.92-104, 2007. ,
DOI : 10.1109/MSP.2007.323274
URL : https://hal.archives-ouvertes.fr/hal-01161253
Disfluency Rates in Conversation: Effects of Age, Relationship, Topic, Role, and Gender, Language and Speech, vol.35, issue.2, pp.123-147, 2001. ,
DOI : 10.1044/jshr.3504.782
Efficient spoken term discovery using randomized algorithms, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011. ,
DOI : 10.1109/ASRU.2011.6163965
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.232.4390
Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems Available: https, 2014. ,
Cross-lingual image caption generation Long Papers) Association for Computational Linguistics, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp.1780-1790, 2016. ,
Microsoft COCO captions: Data collection and evaluation server, 2015. ,