J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural module networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.39-48, 2016.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, ICCV, 2015.

H. Ben-younes, R. Cadène, N. Thome, and M. Cord, Mutan: Multimodal tucker fusion for visual question answering, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02073637

J. D. Carroll, C. , and J. , Analysis of individual differences in multidimensional scaling via an n-way generalization of "eckart-young" decomposition, 1970.

M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome et al., Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings, SIGIR, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01931470

A. Cichocki, D. P. Mandic, A. H. Phan, C. F. Caiafa, G. Zhou et al., Tensor decompositions for signal processing applications: From twoway to multiway component analysis, 2015.

B. Dai, Y. Zhang, and D. Lin, Detecting visual relationships with deep relational networks, CVPR, 2017.

L. De-lathauwer, Decompositions of a higher-order tensor in block terms -part ii: Definitions and uniqueness, SIAM J. Matrix Anal. Appl, vol.30, issue.3, pp.1033-1066, 2008.

C. T. Duong, R. Lebret, and K. Aberer, Multimodal classification for analysing social media, 2017.

T. Durand, T. Mordan, N. Thome, and M. Cord, WILDCAT: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation, CVPR, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01515640

M. Engilberge, L. Chevallier, P. Pérez, and M. Cord, Finding beans in burgers: Deep semantic-visual embedding with localization, CVPR, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02171857

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., EMNLP, 2016.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the v in vqa matter: Elevating the role of image understanding in visual question answering, CVPR, 2017.

H. Zhang, Z. Kyaw, and S. C. , Visual translation embedding network for visual relation detection, CVPR, 2017.

R. A. Harshman, P. Ladefoged, H. Reichenbach, R. I. Jennrich, D. Terbeek et al., Foundations of the parafac procedure: Models and conditions for an "explanatory, 2001.

I. Ilievski and J. Feng, Multimodal learning and reasoning for visual question answering, Advances in Neural Information Processing Systems, pp.551-562, 2017.

Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra et al., Pythia v0.1: The winning entry to the vqa challenge, 2018.

K. Kafle and C. Kanan, An analysis of visual question answering algorithms, The IEEE International Conference on Computer Vision (ICCV), 2017.

J. Kim, K. W. On, W. Lim, J. Kim, J. Ha et al., Hadamard Product for Low-rank Bilinear Pooling, The 5th International Conference on Learning Representations, 2017.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2015.

R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba et al., Skip-thought vectors, NIPS, 2015.

R. Kiros, R. Salakhutdinov, and R. S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, 2015.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision, vol.123, issue.1, pp.32-73, 2017.

Y. Li, W. Ouyang, X. Wang, and X. Tang, Vipcnn: Visual phrase guided convolutional neural network, 2017.

X. Liang, L. Lee, and E. P. Xing, Deep variationstructured reinforcement learning for visual relationship and attribute detection, CVPR, 2017.

J. Lu, X. Lin, D. Batra, and D. Parikh, Deeper lstm and normalized cnn visual question answering model, 2015.

C. Lu, R. Krishna, M. Bernstein, L. Fei-fei, T. Mordan et al., Deformable part-based fully convolutional network for object detection, BMVC, 2016.

H. Noh and B. Han, Training recurrent answering units with joint loss minimization for vqa, 2016.

J. Peyre, I. Laptev, C. Schmid, J. Sivic, . Iccv et al., Tips and tricks for visual question answering: Learnings from the 2017 challenge, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

L. R. Tucker, Some mathematical notes on threemode factor analysis, Psychometrika, vol.31, issue.3, pp.279-311, 1966.

R. Yu, A. Li, V. I. Morariu, and L. S. Davis, Visual relationship detection with internal and external linguistic knowledge distillation, ICCV, 2017.

Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, 2017.

Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, Beyond bilinear: Generalized multi-modal factorized highorder pooling for visual question answering, 2018.

Y. Zhang, J. Hare, and A. Bennett, Learning to count objects in natural images for visual question answering, ICLR, 2018.