R. Aygun and W. Benesova, Multimedia retrieval that works, 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp.63-68, 2018.

A. F. Biten, L. Gomez, M. Rusinol, and D. Karatzas, Good news, everyone! context driven entity-aware captioning for news images, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, 2018.

J. L. Elman, Finding structure in time, Cognitive science, vol.14, issue.2, pp.179-211, 1990.

F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, Vse++: Improving visual-semantic embeddings with hard negatives, 2018.

Z. Fan, Z. Wei, S. Wang, and X. Huang, Bridging by word: Image grounded vocabulary construction for visual captioning, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.6514-6524, 2019.

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets, Advances in Neural Information Processing Systems, vol.27, pp.2672-2680, 2014.

D. R. Hardoon, S. Szedmak, and J. Shawe-taylor, Canonical correlation analysis: An overview with application to learning methods, Neural computation, vol.16, issue.12, pp.2639-2664, 2004.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

X. Huang and Y. Peng, Deep cross-media knowledge transfer, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.8837-8846, 2018.

Y. Huang, Y. Long, and L. Wang, Few-shot image and sentence matching via gated visual-semantic embedding, Proceedings of the AAAI Conference on Artificial Intelligence, vol.33, pp.8489-8496, 2019.

Y. Huang and L. Wang, Acmm: Aligned cross-modal memory for few-shot image and sentence matching, The IEEE International Conference on Computer Vision (ICCV), 2019.

Z. Ji, Y. Sun, Y. Yu, Y. Pang, and J. Han, Attribute-guided network for cross-modal zero-shot hashing, IEEE transactions on neural networks and learning systems, 2019.

Y. Jian, J. Xiao, Y. Cao, A. Khan, and J. Zhu, Deep pairwise ranking with multi-label information for cross-modal retrieval, 2019 IEEE International Conference on Multimedia and Expo (ICME), pp.1810-1815, 2019.

K. Lee, X. Chen, G. Hua, H. Hu, and X. He, Stacked cross attention for image-text matching, The European Conference on Computer Vision (ECCV), 2018.

S. Li, T. Xiao, H. Li, B. Zhou, D. Yue et al., Person search with natural language description, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

W. Li, P. Zhang, L. Zhang, Q. Huang, X. He et al., Object-driven text-to-image synthesis via adversarial training, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.12174-12182, 2019.

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick et al., , 2014.

C. Liu, Z. Mao, A. Liu, T. Zhang, B. Wang et al., Focus your attention: A bidirectional focal attention network for image-text matching, Proceedings of the 27th ACM International Conference on Multimedia, MM '19, pp.3-11, 2019.

C. Liu, Z. Mao, W. Zang, and B. Wang, A neighbor-aware approach for image-text matching, ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.3970-3974, 2019.

F. Liu and R. Ye, A strong and robust baseline for text-image matching, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp.169-176, 2019.

J. Liu, C. Xu, and H. Lu, Cross-media retrieval: State-ofthe-art and open issues, Int. J. of Multimedia Intelligence and Security, vol.1, pp.33-52, 2010.

J. Liu, Z. Zha, R. Hong, M. Wang, and Y. Zhang, Deep adversarial graph attention convolution network for text-based person search, Proceedings of the 27th ACM International Conference on Multimedia, MM '19, pp.665-673, 2019.

J. Luo, Y. Shen, X. Ao, Z. Zhao, and M. Yang, Cross-modal image-text retrieval with multitask learning, Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM '19, pp.2309-2312, 2019.

L. Ma, W. Jiang, Z. Jie, Y. Jiang, and W. Liu, Matching image and sentence with multi-faceted representations, IEEE Transactions on Circuits and Systems for Video Technology, pp.1-1, 2019.

J. Marin, A. Biswas, F. Ofli, N. Hynes, A. Salvador et al., Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images, IEEE Trans. Pattern Anal. Mach. Intell, 2019.

M. Mueller, A. Arzt, S. Balke, M. Dorfer, and G. Widmer, Cross-modal music retrieval and applications: An overview of key methodologies, IEEE Signal Processing Magazine, vol.36, issue.1, pp.52-62, 2019.

H. Nam, J. Ha, and J. Kim, Dual attention networks for multimodal reasoning and matching, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2156-2164, 2017.

Y. Peng, X. Huang, and Y. Zhao, An overview of crossmedia retrieval: Concepts, methodologies, benchmarks, and challenges, IEEE Transactions on Circuits and Systems for Video Technology, vol.28, pp.2372-2385, 2018.

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier et al., Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models, 2015.

T. Qiao, J. Zhang, D. Xu, and D. Tao, Mirrorgan: Learning text-to-image generation by redescription, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1505-1514, 2019.

N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. Lanckriet et al., A new approach to cross-modal multimedia retrieval, Proceedings of the 18th ACM International Conference on Multimedia, MM '10, pp.251-260, 2010.

A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli et al., Learning cross-modal embeddings for cooking recipes and food images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

N. Sarafianos, X. Xu, and I. A. Kakadiaris, Adversarial representation learning for text-to-image matching, Proceedings of the IEEE International Conference on Computer Vision, pp.5814-5824, 2019.

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, Phoneme recognition using time-delay neural networks, IEEE transactions on acoustics, speech, and signal processing, vol.37, issue.3, pp.328-339, 1989.

B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, Adversarial cross-modal retrieval, Proceedings of the 25th ACM International Conference on Multimedia, MM '17, pp.154-162, 2017.

H. Wang, D. Sahoo, C. Liu, E. Lim, and S. C. Hoi, Learning cross-modal embeddings with adversarial networks for cooking recipes and food images, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

L. Wang, Y. Li, J. Huang, and S. Lazebnik, Learning twobranch neural networks for image-text matching tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.41, issue.2, pp.394-407, 2019.

T. Wang, X. Xu, Y. Yang, A. Hanjalic, H. T. Shen et al., Matching images and text with multi-modal tensor fusion and re-ranking, Proceedings of the 27th ACM International Conference on Multimedia, MM '19, pp.12-20, 2019.

Y. Wang, H. Yang, X. Qian, L. Ma, J. Lu et al., Position focused attention network for image-text matching, Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI'19, pp.3792-3798, 2019.

Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan et al., Camp: Cross-modal adaptive message passing for text-image retrieval, The IEEE International Conference on Computer Vision (ICCV), 2019.

Y. Wu, S. Wang, G. Song, and Q. Huang, Learning fragment self-attention embeddings for image-text matching, Proceedings of the 27th ACM International Conference on Multimedia, MM '19, pp.2088-2096, 2019.

T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan et al., Attngan: Fine-grained text to image generation with attentional generative adversarial networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1316-1324, 2018.

Y. Zhang and H. Lu, Deep cross-modal projection learning for image-text matching, The European Conference on Computer Vision (ECCV), 2018.

L. Zhen, P. Hu, X. Wang, and D. Peng, Deep supervised cross-modal retrieval, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

B. Zhu, C. Ngo, J. Chen, and Y. Hao, R2gan: Crossmodal recipe retrieval with generative adversarial network, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.