G. Andrew, R. Arora, J. Bilmes, and K. Livescu, Deep canonical correlation analysis, ICML, 2013.

J. L. Ba, J. R. Kiros, and G. Hinton, , vol.1, p.6, 2016.

G. Chechik, V. Sharma, U. Shalit, and S. Bengio, Large scale online learning of image similarity through ranking, JMLR, vol.11, issue.2, pp.1109-1135, 2010.

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, NIPS w. on Deep Learning, vol.2, p.4, 2014.

J. Dai, Y. Li, K. He, and J. Sun, R-FCN: Object detection via region-based fully convolutional networks, NIPS, 2016.

T. Durand, T. Mordan, N. Thome, and M. Cord, Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation, CVPR, vol.3, p.7, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01515640

T. Durand, N. Thome, and M. Cord, Weldon: Weakly supervised learning of deep convolutional neural networks, CVPR, 2008.
URL : https://hal.archives-ouvertes.fr/hal-01343785

A. Eisenschtat and L. Wolf, Linking image and text with 2-way nets, vol.6, p.7, 2016.

A. Eisenschtat and L. Wolf, Linking image and text with 2-way nets, CVPR, 2017.

F. Faghri, D. Fleet, J. R. Kiros, and S. Fidler, VSE++: Improved visual-semantic embeddings, vol.6, p.7, 2005.

A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean et al., DeViSE: A deep visual-semantic embedding model, NIPS, vol.1, 2013.

A. Frome, Y. Singer, F. Sha, and J. Malik, Learning globallyconsistent local distance functions for shape-based image retrieval and classification, ICCV, 2007.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, vol.2, p.6, 2016.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

E. Hoffer and N. Ailon, Deep metric learning using triplet network, ICLRw, 2015.

H. Hotelling, Relations between two sets of variates, Biometrika, issue.2, 1936.

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, CVPR, 2005.

Y. Kim, Convolutional neural networks for sentence classification, EMNLP, 2014.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2014.

R. Kiros, R. Salakhutdinov, and R. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, 2005.

R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun et al., Skip-thought vectors, NIPS, vol.2, p.4, 2015.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS, 2012.

P. L. Lai and C. Fyfe, Kernel and nonlinear canonical correlation analysis, Int. J. Neural Syst, issue.2, 2000.

T. Lei and Y. Zhang, Training RNNs as fast as CNNs, p.6, 2004.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common objects in context, ECCV, vol.1, p.5, 2014.

L. Ma, Z. Lu, L. Shang, and H. Li, Multimodal convolutional neural networks for matching image and sentence, ICCV, 2015.

A. Mignon and F. Jurie, PCCA: A new approach for distance learning from sparse pairwise constraints, CVPR, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00806007

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, vol.2, p.4, 2013.

H. Nam, J. Ha, and J. Kim, Dual attention networks for multimodal reasoning and matching, 2017.

Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, Hierarchical multimodal LSTM for dense visual-semantic embedding, CVPR, 2017.

J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, EMNLP, 2014.

Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille, Joint imagetext representation by gaussian visual-semantic embedding, ACMMM, 2016.

A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli et al., Learning cross-modal embeddings for cooking recipes and food images, CVPR, 2017.

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh et al., Grad-CAM: Visual explanations from deep networks via gradient-based localization, 2017.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

Y. L. Sumit-chopra and R. Hadsell, Learning a similarity metric discriminatively, with application to face verification, CVPR, 2005.

L. Wang, Y. Li, and S. Lazebnik, Learning deep structurepreserving image-text embeddings, CVPR, 2016.

L. Wang, Y. Li, and S. Lazebnik, Learning two-branch neural networks for image-text matching tasks, vol.6, p.7, 2003.

K. Q. Weinberger and L. K. Saul, Distance metric learning for large margin nearest neighbor classification, JMLR, vol.10, issue.2, pp.207-244, 2009.

J. Weston, S. Bengio, and N. Usunier, Wsabie: Scaling up to large vocabulary image annotation, IJCAI, 2011.

F. Xiao, L. Sigal, and Y. Lee, Weakly-supervised visual grounding of phrases with linguistic structures, CVPR, vol.6, p.7, 2017.

E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng, Distance metric learning, with application to clustering with side-information, NIPS, 2002.

F. Yan and K. Mikolajczyk, Deep correlation for matching images and text, CVPR, 2015.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, CVPR, 2008.