A. Drew, C. Hudson, and . Manning, Gqa: A new dataset for real-world visual reasoning and compositional question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6700-6709, 2019.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the v in vqa matter: Elevating the role of image understanding in visual question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6904-6913, 2017.

R. Cadene, C. Dancette, M. Cord, and D. Parikh, Reducing unimodal biases for visual question answering, Advances in Neural Information Processing Systems, pp.839-850, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02507524

C. Clark, M. Yatskar, and L. Zettlemoyer, Don't take the easy way out: Ensemble based methods for avoiding known dataset biases, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.4060-4073, 2019.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., Vqa: Visual question answering, Proceedings of the IEEE international conference on computer vision, pp.2425-2433, 2015.

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Fei-fei, L. Zitnick et al., Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2901-2910, 2017.

A. Agrawal, D. Batra, and D. Parikh, Analyzing the behavior of visual question answering models, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.1955-1960, 2016.

A. Das, C. L. Harsh-agrawal, D. Zitnick, D. Parikh, and . Batra, Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach, Women also snowboard: Overcoming bias in captioning models, European Conference on Computer Vision, pp.793-811, 2018.

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Don't just assume; look and answer: Overcoming priors for visual question answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

A. Sainandan-ramakrishnan, S. Agrawal, and . Lee, Overcoming language priors in visual question answering with adversarial regularization, Advances in Neural Information Processing Systems, pp.1541-1551, 2018.

J. Wu and R. Mooney, Self-critical reasoning for robust visual question answering, Advances in Neural Information Processing Systems, pp.8601-8611, 2019.

S. Ramprasaath-r-selvaraju, Y. Lee, H. Shen, S. Jin, L. Ghosh et al., Taking a hint: Leveraging explanations to make vision and language models more grounded, Proceedings of the IEEE International Conference on Computer Vision, pp.2591-2600, 2019.

M. Malinowski and M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, Advances in neural information processing systems, pp.1682-1690, 2014.

D. Hudson, D. Christopher, and . Manning, Learning by abstraction: The neural state machine, Advances in Neural Information Processing Systems, pp.5901-5914, 2019.

D. Bahdanau, A. I. Element, S. Harm-de-vries, P. Murty, Y. Beaudoin et al., Closure: Assessing systematic generalization of clevr models

D. Teney, Ehsan Abbasnejad, and Anton van den Hengel. Unshuffling data for improved generalization, 2020.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention is all you need, Advances in neural information processing systems, pp.5998-6008, 2017.

J. Pennington, R. Socher, and C. Manning, Glove: Global vectors for word representation, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.1532-1543, 2014.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottom-up and top-down attention for image captioning and visual question answering, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.6077-6086, 2018.

Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular co-attention networks for visual question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.6281-6290, 2019.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, 2014.