E. Acar and B. Yener, Unsupervised Multiway Data Analysis: A Literature Survey, IEEE Transactions on Knowledge and Data Engineering, vol.21, issue.1, pp.6-20, 2009.

. Acar, S. A. Evrim, M. S. Çamtepe, B. Krishnamoorthy, and . Yener, Modeling and Multiway Analysis of Chatroom Tensors, Proceedings of the 2005 IEEE International Conference on Intelligence and Security Informatics. ISI'05, pp.256-268, 2005.

M. Acharya, K. Kafle, and C. Kanan, TallyQA: Answering Complex Counting Questions, Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (cit, p.97, 2019.

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.30, p.93, 2018.

C. M. Andersen and R. Bro, Practical aspects of PARAFAC modeling of fluorescence excitation-emission data, Journal of Chemometrics, vol.17, p.20, 2003.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit. on pp, vol.85, p.91, 2018.

J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Learning to Compose Neural Networks for Question Answering, NAACL HLT 2016, pp.1545-1554, 2016.

J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural module networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.39-48, 2016.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.

H. Azizpour, A. Sharif-razavian, J. Sullivan, A. Maki, and S. Carlsson, Factors of Transferability for a Generic ConvNet Repbibliography resentation, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38.9, p.13, 2016.

B. W. Bader, R. A. Harshman, and T. G. Kolda, Temporal Analysis of Semantic Graphs Using ASALSAN, Seventh IEEE International Conference on Data Mining (ICDM 2007), pp.33-42, 2007.

D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, Proceedings of the International Conference on Learning Representations (ICLR, p.22, 2015.

Y. Bai, J. Fu, T. Zhao, and T. Mei, Deep Attention Neural Tensor Network for Visual Question Answering, Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2018.

. Battaglia, Relational inductive biases, deep learning, and graph networks, p.81, 2018.

H. Ben-younes, R. Cadène, N. Thome, and M. Cord, BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection, Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.
URL : https://hal.archives-ouvertes.fr/hal-02073644

*. Ben-younes, H. , R. Cadène, *. , N. Thome et al., MUTAN: Multimodal Tucker Fusion for Visual Question Answering, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
URL : https://hal.archives-ouvertes.fr/hal-02073637

*. Ben-younes, H. , R. Cadène, *. , N. Thome et al., MUREL: Multimodal Relational Reasoning for Visual Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
URL : https://hal.archives-ouvertes.fr/hal-02073649

Y. Bengio, P. Simard, and P. Frasconi, Learning Long-Term Dependencies with Gradient Descent is Difficult, IEEE Transactions on Neural Networks 5.2, p.16, 1994.

M. Bodén, A Guide to Recurrent Neural Networks and Backpropagation (cit, p.16, 2001.

. Bottou, F. E. Léon, J. Curtis, and . Nocedal, Optimization Methods for Large-Scale Machine Learning, SIAM Review, vol.60, 2016.

R. Bro, Review on Multiway Analysis in Chemistry, Critical Reviews in Analytical Chemistry, vol.36, p.20, 2000.

J. Carroll, J. Douglas, and . Chang, Analysis of individual differences in multidimensional scaling via an n-way generalization of "EckartYoung" decomposition, Psychometrika (cit, p.60, 1970.

M. Charikar, K. Chen, and M. Farach-colton, Finding Frequent Items in Data Streams, International Colloquium on Automata, Languages and Programming, vol.684566, p.19, 2002.

Z. Chen, Z. Yanpeng, H. Shuaiyi, T. Kewei, and M. Yi, Structured Attentions for Visual Question Answering, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.

K. Cho, B. Van-merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), p.17, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433235

J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, , 2014.

, Presented at the Deep Learning workshop at NIPS2014, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, p.17

A. Cichocki, D. Mandic, L. De-lathauwer, G. Zhou, Q. Zhao et al., Tensor decompositions for signal processing applications: From two-way to multiway component analysis, IEEE Signal Processing Magazine, vol.32, pp.145-163, 2015.

P. Comon, Tensors : A brief introduction, IEEE Signal Processing Magazine, vol.31, p.20, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00923279

B. Dai, Y. Zhang, and D. Lin, Detecting Visual Relationships with Deep Relational Networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav et al., Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit, p.99, 2017.

L. De-lathauwer, Decompositions of a Higher-Order Tensor in Block Terms -Part II: Definitions and Uniqueness, In: SIAM J. Matrix Anal. Appl, vol.30, issue.3, p.61, 2008.

J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, p.18, 2018.

J. L. Elman, Finding structure in time, COGNITIVE SCIENCE 14, vol.2, p.16, 1990.

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). The Association for Computational Linguistics (cit. on pp. 13, vol.46, p.66, 2016.

K. Fukushima, Neocognitron: a Self Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position, Biological Cybernetics, vol.36, p.12, 1980.

G. Kolda, B. W. Tamara, J. P. Bader, and . Kenny, Higher-Order Web Link Analysis Using Multilinear Algebra, pp.242-249, 2005.

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang et al., Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering, Advances in Neural Information Processing Systems (NIPS). NIPS'15, vol.13, p.11, 2015.

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets, Advances in Neural Information Processing Systems (NIPS), pp.2672-2680, 2014.

G. Goovaerts, . O.-de, B. Wel, R. Vandenberk, S. Willems et al., Detection of irregular heartbeats using tensors, 2015 Computing in Cardiology Conference (CinC), p.20, 2015.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit. on pp. 29, vol.63, 2017.

K. Greff, R. Kumar-srivastava, J. Koutn?k, R. Bas, J. Steunebrink et al., LSTM: A Search Space Odyssey, p.16, 2015.

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin et al., VizWiz Grand Challenge: Answering Visual Questions from Blind People, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit, p.97, 2018.

D. R. Hardoon, R. Sandor, J. R. Szedmak, and . Shawe-taylor, Canonical Correlation Analysis: An Overview with Application to Learning Methods, Neural Comput. 16, vol.12, issue.2, pp.2639-2664, 2004.

R. A. Harshman, P. Ladefoged, H. Reichenbach, R. I. Jennrich, D. Terbeek et al., Foundations of the Parafac Procedure: Models and Conditions for an "explanatory" Multimodal Factor Analysis, 2001.

K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (cit, p.28, 2017.

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit, p.13, 2016.

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Comput. 9, vol.8, p.16, 1997.

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Comput. 9, vol.8, p.21, 1997.

H. Hotelling, Relations Between Two Sets of Variates, Biometrika 28.3-4, pp.321-377, 1936.

G. Hu, Y. Hua, Y. Yuan, Z. Zhang, Z. Lu et al.,

T. M. Mukherjee, N. M. Hospedales, Y. Robertson, and . Yang, Attribute-Enhanced Face Recognition With Neural Tensor Fusion Networks, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (cit, p.22, 2017.

R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, Learning to Reason: End-to-End Module Networks for Visual Question Answering, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (cit. on p, p.27, 2017.

A. Jabri, A. Joulin, and L. Van-der-maaten, Revisiting Visual Question Answering Baselines, Computer Vision -ECCV 2016, p.19, 2016.

Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra et al., Pythia v0.1: The Winning Entry to the VQA Challenge, vol.85, p.91, 2018.

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Fei-fei, C. L. Zitnick et al., CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.1988-1997, 2017.

J. Johnson, B. Hariharan, L. Van-der-maaten, J. Hoffman, L. Fei-fei et al., Inferring and Executing Programs for Visual Reasoning, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (cit. on p, p.27, 2017.

K. Kafle and C. Kanan, An Analysis of Visual Question Answering Algorithms, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.

K. Kafle, B. Price, S. Cohen, and C. Kanan, DVQA: Understanding Data Visualizations via Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit, p.97, 2018.

A. Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks, vol.17, p.16, 2015.

J. Kim, J. Jun, B. Zhang, ;. S. Bengio, H. Wallach et al., Bilinear Attention Networks, Advances in Neural Information Processing Systems (NIPS), pp.1571-1581, 2018.

J. -. Kim, S. Hwa, D. Lee, M. Kwak, J. Heo et al., Multimodal Residual Learning for Visual QA, Advances in Neural Information Processing Systems (NIPS), pp.361-369, 2016.

J. Kim, W. Kyoung-woon-on, J. Lim, J. Kim, B. Ha et al., Hadamard Product for Low-rank Bilinear Pooling, Proceedings of the International Conference on Learning Representations (ICLR) (cit. on pp. 13, vol.46, p.92, 2017.

D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, vol.46, 2014.

R. Kiros, R. Salakhutdinov, and R. Zemel, Multimodal Neural Language Models, Proceedings of Machine Learning Research. Bejing, pp.595-603, 2014.

R. Kiros, R. Salakhutdinov, and R. S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, 2014.

R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba et al., Skip-thought Vectors, Advances in Neural Information Processing Systems (NIPS), pp.3294-3302, 2015.

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems (NIPS), vol.45, p.12, 2012.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision, pp.32-73, 2017.

V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, Speeding up convolutional neural networks using fine-tuned CP-decomposition, p.21, 2014.

Y. Lecun, B. Boser, J. Denker, D. Henderson, R. Howard et al., Backpropagation Applied to Handwritten Zip Code Recognition, Neural computation 1.4, p.12, 1989.

Y. Li, W. Ouyang, X. Wang, and X. Tang, ViP-CNN: Visual Phrase Guided Convolutional Neural Network, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit, vol.72, p.67, 2017.

X. Liang, L. Lee, and E. P. Xing, Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit, p.72, 2017.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common Objects in Context, Proceedings of the IEEE European Conference on Computer Vision (ECCV), p.28, 2014.

T. Lin, A. Roychowdhury, and S. Maji, Bilinear CNN Models for Fine-grained Visual Recognition, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (cit, p.59, 2015.

J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Networks for Semantic Segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.13, 2015.

C. Lu, R. Krishna, M. Bernstein, and L. Fei-fei, Visual Relationship Detection with Language Priors, Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2016.

J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical Question-Image Co-Attention for Visual Question Answering, Advances in Neural Information Processing Systems (NIPS), pp.289-297, 2016.

L. Ma, Z. Lu, and H. Li, Learning to Answer Questions from Image Using Convolutional Neural Network, Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). AAAI'16, vol.13, p.11, 2016.

M. Malinowski, C. Doersch, A. Santoro, and P. Battaglia, Learning Visual Question Answering by Bootstrapping Hard Attention, Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2018.

M. Malinowski, M. Fritz, ;. M. Welling, C. Cortes, N. D. Lawrence et al., A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input, Advances in Neural Information Processing Systems (NIPS). Ed. by Z. Ghahramani, pp.1682-1690, 2014.

M. Malinowski and M. Fritz, Towards a Visual Turing Challenge, Learning Semantics (cit, vol.4, 2014.

M. Malinowski, M. Rohrbach, and M. Fritz, Ask Your Neurons: A Deep Learning Approach to Visual Question Answering, 2016.

D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar, Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit. on p, p.27, 2018.

T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, Recurrent neural network based language model, p.16, 2010.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems (NIPS), vol.18, p.15, 2013.

F. Miwakeichi, E. Mart?nez-montes, P. A. Valdés-sosa, N. Nishiyama, H. Mizuhara et al., Decomposing EEG data into space-time-frequency components using Parallel Factor Analysis, p.20, 2004.

M. Mørup, Applications of tensor (multiway array) factorizations and decompositions in data mining, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.1, pp.24-40, 2011.

H. Nam, J. Ha, and J. Kim, Dual Attention Networks for Multimodal Reasoning and Matching, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.25, 2017.

H. Noh and B. Han, Training recurrent answering units with joint loss minimization for vqa, vol.93, p.66, 2016.

. Norcliffe-brown, E. Will, S. Vafeias, and . Parisot, Learning Conditioned Graph Structures for Interpretable Visual Question Answering, 2018.

C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert et al., The Building Blocks of Interpretability, p.24, 2018.

C. Olah, Understanding LSTM Networks, p.16, 2015.

R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, JMLR Proceedings. JMLR.org, vol.28, p.16, 2013.

J. Pennington, R. Socher, and C. Manning, Glove: Global vectors for word representation, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), p.15, 2014.

E. Perez and F. Strub,

. Courville, FiLM: Visual Reasoning with a General Conditioning Layer, Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (cit. on pp. 28, vol.84, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01648685

F. Perronnin, J. Sanchez, and T. Mensink, Improving the Fisher Kernel for Large-Scale Image Classification, Proceedings of the IEEE European Conference on Computer Vision (ECCV), 2010.
URL : https://hal.archives-ouvertes.fr/inria-00548630

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark et al., Deep contextualized word representations, 2018.

J. Peyre, I. Laptev, C. Schmid, and J. Sivic, Weaklysupervised learning of visual relations, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (cit, p.72, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01576035

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei et al., Language Models are Unsupervised Multitask Learners, 2019.

A. Razavian, H. Sharif, J. Azizpour, S. Sullivan, and . Carlsson, CNN Features off-the-shelf: an Astounding Baseline for Recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop (cit, p.13, 2014.

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele et al., Generative Adversarial Text to Image Synthesis, Proceedings of Machine Learning Research, vol.48, pp.1060-1069, 2016.

M. Ren, R. Kiros, and R. S. Zemel, Exploring Models and Data for Image Question Answering, Advances in Neural Information Processing Systems (NIPS), pp.2953-2961, 2015.

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Advances in Neural Information Processing Systems (NIPS), p.14, 2015.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision, pp.211-252, 2015.

S. Shah-anand-mishra, N. Yadati, and P. P. Talukdar, KVQA: Knowledge-Aware Visual Question Answering, Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (cit. on pp. 97, 99), 2019.

A. Santoro, D. Raposo, G. David, M. Barrett, R. Malinowski et al., A simple neural network module for relational reasoning, Advances in Neural Information Processing Systems (NIPS), p.26, 2017.

Y. Shi and T. Furlanello, Question Type Guided Attention in Visual Question Answering, Proceedings of the IEEE European Conference on Computer Vision (ECCV) (cit, vol.93, p.91, 2018.

K. J. Shih, S. Singh, and D. Hoiem, Where To Look: Focus Regions for Visual Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded Compositional Semantics for Finding and Describing Images with Sentences, Transactions of the Association for Computational Linguistics (TACL), vol.2, pp.207-218, 2014.

J. Sun, H. Zeng, H. Liu, Y. Lu, and Z. Chen, CubeSVD: A Novel Approach to Personalized Web Search, Proceedings of the 14th International Conference on World Wide Web. WWW '05, p.21, 2005.

D. Teney and P. Anderson, Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

L. R. Tucker, Some mathematical notes on three-mode factor analysis, Psychometrika 31.3, vol.36, p.30, 1966.

M. Vasilescu, O. Alex, and D. Terzopoulos, Multilinear Analysis of Image Ensembles: TensorFaces, Proceedings of the IEEE European Conference on Computer Vision (ECCV), p.20, 2002.

H. Wang and N. Ahuja, A Tensor Approximation Approach to Dimensionality Reduction, International Journal of Computer Vision, p.20, 2008.

Q. Wu, P. Wang, C. Shen, A. Dick, A. Van-den et al., Ask me anything: free-form visual question answering based on knowledge from external sources, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

C. Xiong, S. Merity, and R. Socher, Dynamic Memory Networks for Visual and Textual Question Answering, JMLR.org, p.13, 2016.

H. Xu and K. Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, Proceedings of the IEEE European Conference on Computer Vision (ECCV), pp.451-466, 2016.

K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville et al., Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML'15, p.22, 2015.

F. Yan and K. Mikolajczyk, Deep Correlation for Matching Images and Text, 2015.

S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang et al., Multilinear Discriminant Analysis for Face Recognition, IEEE Transactions on Image Processing, p.20, 2007.

Y. Yang and T. M. Hospedales, Deep Multi-task Representation Learning: A Tensor Factorisation Approach, Proceedings of the International Conference on Learning Representations (ICLR) (cit, p.21, 2017.

Y. Yang and T. M. Hospedales, Unifying Multi-domain Multitask Learning: Tensor and Neural Network Perspectives, Domain Adaptation in Computer Vision Applications, p.22, 2017.

Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, Stacked Attention Networks for Image Question Answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.21-29, 2016.

J. Ye, L. Wang, G. Li, D. Chen, S. Zhe et al., Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli et al., Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, Advances in Neural Information Processing Systems (NIPS), p.28, 2018.

R. Yu, S. Zheng, A. Anandkumar, and Y. Yue, Longterm forecasting using tensor-train RNNs, p.22, 2017.

R. Yu, A. Li, V. I. Morariu, and L. S. Davis, Visual Relationship Detection With Internal and External Linguistic Knowledge Distillation, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.

Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering, Proceedings of the IEEE International Conference on Computer Vision (ICCV) (cit. on pp. 13, 2017.

Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering, IEEE Transactions on Neural Networks and Learning Systems, 2018.

H. Zhang, Z. Kyaw, S. Chang, and T. Chua, Visual Translation Embedding Network for Visual Relation Detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (cit, vol.72, p.67, 2017.

H. Zhang, Z. Kyaw, J. Yu, and S. Chang, PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.

Y. Zhang, J. Hare, and A. Bennett, Learning to Count Objects in Natural Images for Visual Question Answering, Proceedings of the International Conference on Learning Representations (ICLR, 2018.

B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus, Simple baseline for visual question answering, p.16, 2015.