S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., Vqa: Visual question answering, Proc. of ICCV, 2015.
DOI : 10.1109/iccv.2015.279
URL : http://arxiv.org/pdf/1505.00468

J. L. Ba, J. R. Kiros, and G. E. Hinton, Layer normalization. Deep Learning Symposium (NIPS, 2016.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, Proc. of ICLR, 2015.

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, Proc. of ICML, 2015.

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav et al., Visual dialog, 2017.
DOI : 10.1109/cvpr.2017.121

H. De-vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle et al., Guesswhat?! visual object discovery through multi-modal dialogue, Proc. of CVPR, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01549641

J. B. Delbrouck and S. Dupont, Modulating and attending the source image during encoding improves multimodal translation, Visually-Grounded Interaction and Language Workshop (NIPS, 2017.

V. Dumoulin, J. Shlens, and M. Kudlur, A Learned Representation For Artistic Style, Proc. of ICLR, 2017.

V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. D. Vries et al., Feature-wise transformations, Distill, 2018.
DOI : 10.23915/distill.00011
URL : https://hal.archives-ouvertes.fr/hal-01841985

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision, vol.88, issue.2, pp.303-338, 2010.
DOI : 10.1007/s11263-009-0275-4
URL : http://www.dai.ed.ac.uk/homes/ckiw/postscript/ijcv_voc09.pdf

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal compact bilinear pooling for visual question answering and visual grounding, Proc. of EMNLP, 2016.
DOI : 10.18653/v1/d16-1044
URL : https://doi.org/10.18653/v1/d16-1044

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, Proc. of of CVPR, 2014.
DOI : 10.1109/cvpr.2014.81
URL : http://arxiv.org/pdf/1311.2524

A. Graves, G. Wayne, and I. Danihelka, Neural turing machines, 2014.

A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka et al., Hybrid computing using a neural network with dynamic external memory, Nature, vol.538, issue.7626, p.471, 2016.
DOI : 10.1038/nature20101

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proc. of CVPR, 2016.
DOI : 10.1109/cvpr.2016.90
URL : http://arxiv.org/pdf/1512.03385

R. Hu, M. Rohrbach, and T. Darrell, Segmentation from natural language expressions, Proc. of ECCV, 2016.
DOI : 10.1007/978-3-319-46448-0_7
URL : http://arxiv.org/pdf/1603.06180

R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko et al., Natural language object retrieval, Proc. of CVPR, 2016.
DOI : 10.1109/cvpr.2016.493
URL : http://arxiv.org/pdf/1511.04164

D. A. Hudson and C. D. Manning, Compositional attention networks for machine reasoning, Proc. of ICL, 2018.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proc. of ICML, 2015.

A. Jabri, A. Joulin, and L. Van-der-maaten, Revisiting visual question answering baselines, Proc. of ECCV, 2016.

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Fei-fei, C. L. Zitnick et al., Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, Proc. of CVPR, 2017.

K. Kafle and C. Kanan, Visual question answering: Datasets, algorithms, and future challenges, Computer Vision and Image Understanding, vol.163, pp.3-20, 2017.

S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, Referitgame: Referring to objects in photographs of natural scenes, Proc. of EMNLP, 2014.

J. H. Kim, K. W. On, W. Lim, J. Kim, J. W. Ha et al., Hadamard Product for Low-rank Bilinear Pooling, Proc. of ICLR, 2017.

J. H. Kim, S. W. Lee, D. Kwak, M. O. Heo, J. Kim et al., Multimodal residual learning for visual qa, Proc. of NIPS, 2016.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, Proc. of ICLR, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Proc. of of NIPS, 2012.

S. W. Lee, Y. J. Heo, and B. T. Zhang, Answerer in questioner's mind for goal-oriented visual dialogue, Visually-Grounded Interaction and Language Workshop (NIPS, 2018.

T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft coco: Common objects in context, Proc. of ECCV, 2014.

J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, Proc. of CVPR, 2015.

J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, Proc. of NIPS, 2016.

R. Luo and G. Shakhnarovich, Comprehension-guided referring expressions, Proc. of CVPR, 2017.

M. T. Luong, H. Pham, and C. D. Manning, Effective approaches to attention-based neural machine translation, Proc. of EMNLP, 2015.

M. Malinowski, M. Rohrbach, and M. Fritz, Ask your neurons: A neural-based approach to answering questions about images, Proc. of ICCV, 2015.

H. Mller, P. Clough, T. Deselaers, and B. Caputo, ImageCLEF: Experimental Evaluation in Visual Information Retrieval, 2012.

V. K. Nagaraja, V. I. Morariu, and L. S. Davis, Modeling context between objects for referring expression understanding, Proc. of ECCV, 2016.

V. Nair and G. E. Hinton, Rectified linear units improve restricted boltzmann machines, Proc. of ICML, 2010.

E. Perez, F. Strub, H. De-vries, V. Dumoulin, and A. Courville, Film: Visual reasoning with a general conditioning layer, Proc. of AAAI, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01648685

A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, Grounding of textual phrases in images by reconstruction, Proc. of ECCV, 2016.

C. Rupprecht, I. Laina, N. Navab, G. D. Hager, and F. Tombari, Guide me: Interacting with deep networks, Proc. of CVPR, 2018.
DOI : 10.1109/cvpr.2018.00892
URL : http://arxiv.org/pdf/1803.11544

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision, vol.115, issue.3, pp.211-252, 2015.
DOI : 10.1007/s11263-015-0816-y
URL : http://arxiv.org/pdf/1409.0575

F. Strub, H. De-vries, J. Mary, B. Piot, A. Courville et al., End-to-end optimization of goal-driven and visually grounded dialogue systems harm de vries, Proc. of IJCAI, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01549642

S. Sukhbaatar, J. Weston, and R. Fergus, End-to-end memory networks, Proc. of NIPS, 2015.

H. De-vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin et al., Modulating early visual processing by language, Proc. of NIPS, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01648683

J. Weston, S. Chopra, and A. Bordes, Memory networks, 2014.

C. Xiong, S. Merity, and R. Socher, Dynamic memory networks for visual and textual question answering, Proc. of ICML, 2016.

H. Xu and K. Saenko, Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, Proc. of ECCV, 2016.
DOI : 10.1007/978-3-319-46478-7_28
URL : http://arxiv.org/pdf/1511.05234

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, Proc. of ICML, 2015.

L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos, Efficient video object segmentation via network modulation, Proc. of CVPR, 2018.
DOI : 10.1109/cvpr.2018.00680
URL : http://arxiv.org/pdf/1802.01218

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu et al., Mattnet: Modular attention network for referring expression comprehension, Proc. of CVPR, 2018.
DOI : 10.1109/cvpr.2018.00142
URL : http://arxiv.org/pdf/1801.08186

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, Modeling context in referring expressions, Proc. of ECCV, 2016.
DOI : 10.1007/978-3-319-46475-6_5
URL : http://arxiv.org/pdf/1608.00272

L. Yu, H. Tan, M. Bansal, and T. L. Berg, A joint speakerlistener-reinforcer model for referring expressions, Proc. of CVPR, 2016.

Y. Zhu, S. Zhang, and D. Metaxas, Reasoning about fine-grained attribute phrases using reference games, Visually-Grounded Interaction and Language Workshop (NIPS, 2017.

B. Zhuang, Q. Wu, C. Shen, I. D. Reid, and A. Van-den-hengel, Parallel attention: A unified framework for visual object discovery through dialogs and queries, Proc. of CVPR, 2018.