E. Learned-miller, G. B. Huang, A. Roychowdhury, H. Li, and G. Hua, Labeled faces in the wild: A survey Advances in Face Detection and Facial Image Analysis, pp.189-248, 2016.

J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu, Deep recurrent models with fast-forward connections for neural machine translation, Transactions of the Association for Computational Linguistics (TACL), vol.4, pp.371-383, 2016.

G. Saon, T. Sercu, S. J. Rennie, and H. J. Kuo, The IBM 2016 English Conversational Telephone Speech Recognition System, Interspeech 2016, pp.520-527, 2016.
DOI : 10.21437/Interspeech.2016-1460
URL : http://arxiv.org/abs/1505.05899

S. David and H. Aja, Mastering the game of Go with deep neural networks and tree search, Nature, vol.529, issue.7587, pp.484-489, 2016.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Advances in Neural Information 525 Processing Systems (NIPS), pp.1097-1105, 2012.
DOI : 10.1162/neco.2009.10-08-881
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.299.205

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778, 2016.
DOI : 10.1109/CVPR.2016.90
URL : http://arxiv.org/abs/1512.03385

J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, How transferable are features in deep neural networks?, Advances in Neural Information Processing Systems (NIPS), pp.3320-3328, 2014.

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the Devil in the Details: Delving Deep into Convolutional Nets, Proceedings of the British Machine Vision Conference 2014, 2014.
DOI : 10.5244/C.28.6

K. He, X. Zhang, S. Ren, and J. Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, issue.9
DOI : 10.1109/TPAMI.2015.2389824
URL : http://arxiv.org/abs/1406.4729

Y. Gong, L. Wang, R. Guo, and S. Lazebnik, Multi-scale Orderless Pooling of Deep Convolutional Activation Features, European Conference on Computer Vision (ECCV), pp.392-407, 2014.
DOI : 10.1007/978-3-319-10584-0_26
URL : http://arxiv.org/abs/1403.1840

R. Arandjelovic, P. Gronát, A. Torii, T. Pajdla, and J. Sivic, Netvlad: CNN 545 architecture for weakly supervised place recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5297-5307, 2016.
DOI : 10.1109/cvpr.2016.572
URL : http://arxiv.org/abs/1511.07247

C. Xu, D. Tao, C. Xu, Y. Rui, and M. A. , Large-margin weakly supervised dimensionality reduction, International Conference on Machine Learning, 2014.

T. G. Dietterich, R. H. Lathrop, and T. Lozano-pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence, vol.89, issue.1-2, pp.31-71, 1997.
DOI : 10.1016/S0004-3702(96)00034-3
URL : http://doi.org/10.1016/s0004-3702(96)00034-3

P. F. Felzenszwalb, R. B. Girshick, D. A. Mcallester, and D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, pp.1627-1645, 2010.
DOI : 10.1109/TPAMI.2009.167

D. P. Papadopoulos, A. D. Clarke, F. Keller, and V. Ferrari, Training Object Class Detectors from Eye Tracking Data, European Conference on Computer Vision (ECCV), pp.361-376, 2014.
DOI : 10.1007/978-3-319-10602-1_24
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.645.7140

D. Wen, J. Tao, C. Gui, and . Xu, Large margin multi-modal multitask feature extraction for image classification, IEEE Trans. Image Processing, vol.17, issue.251, pp.560-414, 2016.

Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu et al., Multiview vector-valued manifold regularization for multilabel image classification, IEEE Transac- 565 tions on Neural Networks and Learning Systems, pp.709-722, 2013.

L. Li, H. Su, L. Fei-fei, and E. P. Xing, Object bank: A high-level image representation for scene classification & semantic feature sparsification, Advances in Neural Information Processing Systems, pp.1378-1386, 2010.
DOI : 10.1007/s11263-013-0660-x

A. L. Yuille and A. Rangarajan, The concave-convex procedure (CCCP), pp.570-1033, 2001.
DOI : 10.1162/08997660360581958

X. Wang, N. Thome, and M. Cord, Gaze latent support vector machine for image classification, 2016 IEEE International Conference on Image Processing (ICIP), pp.236-240, 2016.
DOI : 10.1109/ICIP.2016.7532354
URL : https://hal.archives-ouvertes.fr/hal-01342580

W. Li and N. Vasconcelos, Multiple instance learning for soft bags via top in- 575 stances, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4277-4285, 2015.
DOI : 10.1109/cvpr.2015.7299056

S. Ramanathan, V. Yanulevskaya, and N. Sebe, Can computers learn from humans to see better?: inferring scene semantics from viewers' eye movements, International Conference on Multimedia, pp.33-42, 2011.

]. K. Yun, Y. Peng, D. Samaras, G. J. Zelinsky, and T. L. Berg, Studying Relationships between Human Gaze, Description, and Computer Vision, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.580-739, 2013.
DOI : 10.1109/CVPR.2013.101
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.4727

S. Mathe and C. Sminchisescu, Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.37, issue.7
DOI : 10.1109/TPAMI.2014.2366154

G. Ge, K. Yun, D. Samaras, and G. J. Zelinsky, Action classification in still images using human eye movements, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.2015-2031, 2015.
DOI : 10.1109/CVPRW.2015.7301288

]. S. Karthikeyan, V. Jagadeesh, R. Shenoy, M. Ecksteinz, and B. S. Manjunath, From Where and How to What We See, 2013 IEEE International Conference on Computer Vision, pp.590-625, 2013.
DOI : 10.1109/ICCV.2013.83

J. Pan, E. Sayrol, X. Giró-i-nieto, K. Mcguinness, and N. E. Connor, Shallow and Deep Convolutional Networks for Saliency Prediction, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.598-606, 2016.
DOI : 10.1109/CVPR.2016.71
URL : http://arxiv.org/abs/1603.00845

S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, and R. V. Babu, Saliency Unified: A Deep Architecture for simultaneous Eye Fixation Prediction and Salient Object Segmentation, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5781-5790, 2016.
DOI : 10.1109/CVPR.2016.623

T. Walber, A. Scherp, and S. Staab, Can you see it? two novel eye-trackingbased measures for assigning tags to image regions, Advances in Multimedia Modeling, International Conference, pp.36-46, 2013.
DOI : 10.1007/978-3-642-35725-1_4
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.4905

S. Karthikeyan, T. Ngo, M. P. Eckstein, and B. S. Manjunath, Eye tracking 605 assisted extraction of attentionally important objects from videos, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3241-3250, 2015.

N. Shapovalova, M. Raptis, L. Sigal, and G. Mori, Action is in the eye of the beholder: Eye-gaze driven model for spatio-temporal action localization, p.610

D. Damen, T. Leelasawassuk, and W. Mayol-cuevas, You-Do, I-Learn: Egocentric unsupervised discovery of objects and their modes of interaction towards video-based guidance, Computer Vision and Image Understanding, vol.149, pp.98-112, 2016.
DOI : 10.1016/j.cviu.2016.02.016

J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg et al., Gazeenabled egocentric video summarization via constrained submodular maximization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2235-2244, 2015.
DOI : 10.1109/cvpr.2015.7298836
URL : http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784707

H. Su, J. Deng, and L. Fei-fei, Crowdsourcing Annotations for Visual 620 Object Detection, pp.1-6, 2012.

P. Kohli, L. Ladický, and P. H. Torr, Robust Higher Order Potentials for Enforcing Label Consistency, International Journal of Computer Vision, vol.24, issue.3, pp.302-324, 2009.
DOI : 10.1016/S0166-218X(01)00341-9
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.187.8646

S. Lopez, A. Revel, D. Lingrand, and F. Precioso, One gaze is worth ten thou- 625 sand (key-)words, IEEE International Conference on Image Processing (ICIP), pp.3150-3154, 2015.
DOI : 10.1109/icip.2015.7351384

S. Mathe and C. Sminchisescu, Action from still image dataset and inverse optimal control to learn task specific visual scanpaths, Advances in Neural Information Processing Systems, pp.1923-1931, 2013.

S. O. Gilani, R. Subramanian, Y. Yan, D. Melcher, N. Sebe et al., PET: An eye-tracking dataset for animal-centric Pascal object classes, 2015 IEEE International Conference on Multimedia and Expo (ICME), pp.1-6, 2015.
DOI : 10.1109/ICME.2015.7177450
URL : http://arxiv.org/abs/1604.01574

X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, Recipe recognition with large multimodal food dataset, p.635
URL : https://hal.archives-ouvertes.fr/hal-01196959

X. Wang, Z. Zhu, C. Yao, and X. Bai, Relaxed Multiple-Instance SVM with Application to Object Discovery, 2015 IEEE International Conference on Computer Vision (ICCV), pp.1224-1232, 2015.
DOI : 10.1109/ICCV.2015.145
URL : http://arxiv.org/abs/1510.01027

W. Shen, X. Bai, Z. Hu, and Z. Zhang, Multiple instance subspace learning via partial random projection tree for local reflection symmetry in natural images, Pattern Recognition, vol.52, pp.306-316, 2016.
DOI : 10.1016/j.patcog.2015.10.015

M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman, Blocks That Shout: Distinctive Parts for Scene Classification, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.923-930, 2013.
DOI : 10.1109/CVPR.2013.124
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.649.1623

J. Sun and J. Ponce, Learning Discriminative Part Detectors for Image Classification and Cosegmentation, 2013 IEEE International Conference on Computer Vision, pp.3400-3407, 2013.
DOI : 10.1109/ICCV.2013.422
URL : https://hal.archives-ouvertes.fr/hal-00932380

X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu, Max-margin multiple-instance dictionary learning, International Conference on Machine Learning, pp.2013-846

A. Shrivastava, V. M. Patel, J. K. Pillai, and R. Chellappa, Generalized Dictionaries for Multiple Instance Learning, International Journal of Computer Vision, vol.60, issue.4, pp.288-305, 2015.
DOI : 10.1109/CVPR.2010.5539989

S. Andrews, I. Tsochantaridis, and T. Hofmann, Support vector machines for 655 multiple-instance learning, Advances in Neural Information Processing Systems (NIPS), pp.561-568, 2002.

T. Durand, N. Thome, M. Cord, and D. Picard, Incremental learning of latent structural SVM for weakly supervised image classification, 2014 IEEE International Conference on Image Processing (ICIP), pp.4246-4250, 2014.
DOI : 10.1109/ICIP.2014.7025862
URL : https://hal.archives-ouvertes.fr/hal-01077058

H. Bilen, V. P. Namboodiri, and L. J. , Object and Action Classification with Latent Window Parameters, International Journal of Computer Vision, vol.15, issue.4, pp.237-251, 2014.
DOI : 10.1109/CVPR.2010.5540096

H. Azizpour, M. Arefiyan, S. N. Parizi, and S. Carlsson, Spotlight the Negatives: A Generalized Discriminative Latent Model, Procedings of the British Machine Vision Conference 2015, pp.1-11, 2015.
DOI : 10.5244/C.29.18
URL : http://arxiv.org/abs/1507.02144

T. Durand, N. Thome, and M. Cord, MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking, 2015 IEEE International Conference on Computer Vision (ICCV), pp.2713-2721, 2015.
DOI : 10.1109/ICCV.2015.311
URL : https://hal.archives-ouvertes.fr/hal-01343784

T. Durand, N. Thome, and M. Cord, WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4743-4752, 2016.
DOI : 10.1109/CVPR.2016.513
URL : https://hal.archives-ouvertes.fr/hal-01343785

S. Mathe and C. Sminchisescu, Multiple instance reinforcement learning for efficient weakly-supervised detection in images

S. Mathe, A. Pirinen, and C. Sminchisescu, Reinforcement learning for visual 675 object detection, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2894-2902, 2016.
DOI : 10.1109/cvpr.2016.316

I. Shcherbatyi, A. Bulling, and M. Fritz, GazeDPM: Early integration of gaze information in deformable part models

V. Vapnik and R. Izmailov, Learning using privileged information: Similarity 680 control and knowledge transfer, J. Mach. Learn. Res, vol.16, pp.2023-2049, 2015.
DOI : 10.1007/978-3-319-17091-6_1

S. You, C. Xu, Y. Wang, C. Xu, D. Tao et al., Privileged multi-label learning, International Joint Conference on Artificial Intelligence (IJCAI), 2017. M

T. Joachims, T. Finley, and C. J. Yu, Cutting-plane training of structural SVMs, Machine Learning, vol.6, issue.2, pp.27-59, 2009.
DOI : 10.1007/s10994-009-5108-8
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.1367

M. P. Hsueh and -. Wang, The attraction of visual attention to texts in real-world scenes, Journal of Vision, vol.12, pp.1-17, 2012.

T. Ab, Tobii Studio User's Manual Version 3, 2016.

A. Olsen, The Tobii I-VT Fixation Filter, 2012.

L. Fei-fei, A. Iyer, C. Koch, and P. Perona, What do we perceive in a glance of 690 a real-world scene?, Journal of Vision, vol.7, pp.1-29, 2007.

A. Winn and . Zisserman, The Pascal visual object classes challenge: A retrospective, International Journal of Computer Vision, vol.111, issue.1, pp.98-136, 2015.

A. Gordo, A. Gaidon, and F. Perronnin, Deep Fishing: Gradient Features from Deep Nets, Procedings of the British Machine Vision Conference 2015, pp.1-12, 2015.
DOI : 10.5244/C.29.111
URL : http://arxiv.org/abs/1507.06429

M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, European Conference on Computer Vision, pp.818-833, 2014.
DOI : 10.1007/978-3-319-10590-1_53
URL : http://arxiv.org/abs/1311.2901

Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, Contextualizing object detection and classification, CVPR 2011, pp.1585-1592, 2011.
DOI : 10.1109/CVPR.2011.5995330

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and transferring midlevel image representations using convolutional neural networks, IEEE CVPR, pp.1717-1724, 2014.
DOI : 10.1109/cvpr.2014.222
URL : https://hal.archives-ouvertes.fr/hal-00911179

G. Gkioxari, R. Girshick, and J. Malik, Actions and attributes from wholes and 705 parts, IEEE International Conference on Computer Vision (ICCV), pp.2470-2478, 2015.
DOI : 10.1109/iccv.2015.284
URL : http://arxiv.org/abs/1412.2604

M. Hoai, Regularized Max Pooling for Image Categorization, Proceedings of the British Machine Vision Conference 2014, pp.1-12, 2014.
DOI : 10.5244/C.28.32