S. Andrews, I. Tsochantaridis, and T. Hofmann, Support vector machines for multiple-instance learning, NIPS, 2003.

R. Arandjelovi´carandjelovi´c, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, NetVLAD: CNN architecture for weakly supervised place recognition, CVPR, 2016.

S. Avila, N. Thome, M. Cord, E. Valle, and A. Araujo, Pooling in image representation: The visual codeword point of view, Computer Vision and Image Understanding, vol.117, issue.5, 2012.
DOI : 10.1016/j.cviu.2012.09.007
URL : https://hal.archives-ouvertes.fr/hal-01172709

A. Behl, C. V. Jawahar, and M. P. Kumar, Optimizing average precision using weakly supervised data, CVPR, p.5, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00984699

H. Bilen, V. Namboodiri, and L. Van-gool, Object and Action Classification with Latent Window Parameters, IJCV, 2013. 1
DOI : 10.1007/s11263-013-0646-8

M. Blaschko, P. Kumar, and B. Taskar, Tutorial: Visual learning with weak supervision

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the Devil in the Details: Delving Deep into Convolutional Nets, Proceedings of the British Machine Vision Conference 2014
DOI : 10.5244/C.28.6

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, Semantic image segmentation with deep convolutional nets and fully connected crfs, 2015.

T. G. Dietterich, R. H. Lathrop, and T. Lozano-pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence, vol.89, issue.1-2, 1997.
DOI : 10.1016/S0004-3702(96)00034-3

T. Durand, N. Thome, and M. Cord, MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking, 2015 IEEE International Conference on Computer Vision (ICCV), 2006.
DOI : 10.1109/ICCV.2015.311
URL : https://hal.archives-ouvertes.fr/hal-01343784

T. Durand, N. Thome, M. Cord, and D. Picard, Incremental learning of latent structural SVM for weakly supervised image classification, 2014 IEEE International Conference on Image Processing (ICIP), 2014.
DOI : 10.1109/ICIP.2014.7025862
URL : https://hal.archives-ouvertes.fr/hal-01077058

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, 2007.
DOI : 10.1007/s11263-009-0275-4

P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, 2010.
DOI : 10.1109/TPAMI.2009.167

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.81

H. Goh, N. Thome, M. Cord, and J. Lim, Learning Deep Hierarchical Visual Feature Coding, IEEE Transactions on Neural Networks and Learning Systems, 2014.
DOI : 10.1109/TNNLS.2014.2307532
URL : https://hal.archives-ouvertes.fr/hal-01185465

Y. Gong, L. Wang, R. Guo, and S. Lazebnik, Multi-scale Orderless Pooling of Deep Convolutional Activation Features, ECCV, 2014
DOI : 10.1007/978-3-319-10584-0_26

K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, ECCV, 2006.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, 2014.
DOI : 10.1145/2647868.2654889

M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman, Blocks That Shout: Distinctive Parts for Scene Classification, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.124

A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet classification with deep convolutional neural networks, NIPS. 2012. 1

P. Kumar, B. Packer, and D. Koller, Self-paced learning for latent variable models, NIPS, 2010.

K. Lai, F. X. Yu, M. Chen, and S. Chang, Video Event Detection by Inferring Temporal Instance Labels, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.288

M. T. Law, N. Thome, and M. Cord, Fantope Regularization in Metric Learning, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.138
URL : https://hal.archives-ouvertes.fr/hal-01094074

S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Volume 2 (CVPR'06), 2006.
DOI : 10.1109/CVPR.2006.68
URL : https://hal.archives-ouvertes.fr/inria-00548585

W. Li and N. Vasconcelos, Multiple instance learning for soft bags via top instances, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
DOI : 10.1109/CVPR.2015.7299056

E. P. Li, -. Li, H. Su, and L. Fei-fei, Object bank: A high-level image representation for scene classification & semantic feature sparsification, NIPS, 2010.

M. Lin, Q. Chen, and S. Yan, Network in network, ICLR, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01460127

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common Objects in Context, ECCV, 2005.
DOI : 10.1007/978-3-319-10602-1_48

J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298965

P. Mohapatra, C. Jawahar, and M. P. Kumar, Efficient optimization for average precision svm, NIPS. 2014
URL : https://hal.archives-ouvertes.fr/hal-01069917

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.222
URL : https://hal.archives-ouvertes.fr/hal-00911179

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Is object localization for free? - Weakly-supervised learning with convolutional neural networks, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
DOI : 10.1109/CVPR.2015.7298668
URL : https://hal.archives-ouvertes.fr/hal-01015140

M. Pandey and S. Lazebnik, Scene recognition and weakly supervised object localization with deformable part-based models, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126383

G. Papandreou, I. Kokkinos, and P. Savalle, Modeling local and global deformations in Deep Learning: Epitomic convolution, Multiple Instance Learning, and sliding window detection, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
DOI : 10.1109/CVPR.2015.7298636
URL : https://hal.archives-ouvertes.fr/hal-01263611

S. N. Parizi, J. G. Oberlin, and P. F. Felzenszwalb, Reconfigurable models for scene recognition, 2012 IEEE Conference on Computer Vision and Pattern Recognition
DOI : 10.1109/CVPR.2012.6248001

S. N. Parizi, A. Vedaldi, A. Zisserman, and P. F. Felzenszwalb, Automatic discovery and optimization of parts for image classification, 2006.

F. Perronnin and C. R. Dance, Fisher Kernels on Visual Vocabularies for Image Categorization, 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2007.383266

A. Quattoni and A. Torralba, Recognizing indoor scenes, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206537

O. Russakovsky, Y. Lin, K. Yu, and L. Fei-fei, Objectcentric spatial pooling for image classification, ECCV, 2012. 1

F. Sadeghi and M. F. Tappen, Latent Pyramidal Regions for Recognizing Scenes, ECCV, 2012
DOI : 10.1007/978-3-642-33715-4_17

T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, Robust object recognition with cortex-like mechanisms. PAMI, 2007.

G. Sharma, F. Jurie, and C. Schmid, Discriminative spatial saliency for image classification, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6248093
URL : https://hal.archives-ouvertes.fr/hal-00714311

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2006.

J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings Ninth IEEE International Conference on Computer Vision, 2003.
DOI : 10.1109/ICCV.2003.1238663

J. Sun and J. Ponce, Learning Discriminative Part Detectors for Image Classification and Cosegmentation, 2013 IEEE International Conference on Computer Vision
DOI : 10.1109/ICCV.2013.422
URL : https://hal.archives-ouvertes.fr/hal-00932380

C. Thériault, N. Thome, and M. Cord, Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.336

C. Thériault, N. Thome, and M. Cord, Extended Coding and Pooling in the HMAX Model, IEEE Transactions on Image Processing, vol.22, issue.2, p.2013
DOI : 10.1109/TIP.2012.2222900

C. Yu and T. Joachims, Learning structural SVMs with latent variables, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009.
DOI : 10.1145/1553374.1553523

F. X. Yu, D. Liu, S. Kumar, T. Jebara, and S. Chang, ?svm for learning with label proportions, ICML, 2013.

Y. Yue, T. Finley, F. Radlinski, and T. Joachims, A support vector method for optimizing average precision, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, p.5, 2007.
DOI : 10.1145/1277741.1277790

N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev, PANDA: Pose Aligned Networks for Deep Attribute Modeling, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.212

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, Learning Deep Features for Scene Recognition using Places Database, NIPS, vol.5, issue.2 6, 2014.