H. Arora, N. Loeff, D. Forsyth, and N. Ahuja, Unsupervised Segmentation of Objects using Efficient Learning, 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2007.383011

A. Bergamo, L. Bazzani, D. Anguelov, and L. Torresani, Self-taught object localization with deep networks. CoRR, abs/1409, 2014.

M. Blaschko, A. Vedaldi, and A. Zisserman, Simultaneous object detection and ranking with weak supervision, NIPS, 2010.

T. Brox, L. Bourdev, S. Maji, and J. Malik, Object segmentation by alignment of poselet activations to image contours, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995659

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the Devil in the Details: Delving Deep into Convolutional Nets, Proceedings of the British Machine Vision Conference 2014, 2006.
DOI : 10.5244/C.28.6

X. Chen, A. Shrivastava, and A. Gupta, NEIL: Extracting Visual Knowledge from Web Data, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.178

O. Chum and A. Zisserman, An Exemplar Model for Learning Object Classes, 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2007.383050

R. G. Cinbis, J. Verbeek, and C. Schmid, Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.1, 2002.
DOI : 10.1109/TPAMI.2016.2535231
URL : https://hal.archives-ouvertes.fr/hal-01123482

R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A matlab-like environment for machine learning, BigLearn, NIPS Workshop, 2011.

D. Crandall and D. Huttenlocher, Weakly Supervised Learning of Part-Based Spatial Models for Visual Object Recognition, ECCV, 2006.
DOI : 10.1007/11744023_2

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, ECCV Workshop, 2004.

T. Deselaers, B. Alexe, and V. Ferrari, Localizing Objects While Learning Their Appearance, ECCV, 2010.
DOI : 10.1007/978-3-642-15561-1_33
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.8826

S. Divvala, A. Farhadi, and C. Guestrin, Learning Everything about Anything: Webly-Supervised Visual Concept Learning, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.412
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.672.5408

C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros, What makes Paris look like Paris?, p.101, 2012.
DOI : 10.1145/2185520.2185597
URL : https://hal.archives-ouvertes.fr/hal-01053876

J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang et al., Decaf: A deep convolutional activation feature for generic visual recognition, 2013.

M. Everingham, L. Van-gool, C. K. Williams, J. Winn, and A. Zisserman, The pascal visual object classes (VOC) challenge. IJCV, pp.303-338, 2008.

P. Felzenszwalb, R. Girshick, D. Mcallester, and D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, pp.1627-1645, 2004.
DOI : 10.1109/TPAMI.2009.167

P. Felzenszwalb, D. Mcallester, and D. Ramanan, A discriminatively trained, multiscale, deformable part model, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587597

R. Fergus, P. Perona, and A. Zisserman, Object class recognition by unsupervised scale-invariant learning, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., 2003.
DOI : 10.1109/CVPR.2003.1211479

J. Foulds and E. Frank, A review of multi-instance learning assumptions, The Knowledge Engineering Review, vol.2, issue.01, pp.1-25, 2010.
DOI : 10.1016/S0004-3702(96)00034-3

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2014.81

M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459266
URL : https://hal.archives-ouvertes.fr/inria-00439276

B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, Simultaneous Detection and Segmentation, ECCV, 2014.
DOI : 10.1007/978-3-319-10584-0_20

H. Harzallah, F. Jurie, and C. Schmid, Combining efficient object localization and image classification, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459257
URL : https://hal.archives-ouvertes.fr/inria-00439516

M. Hejrati and D. Ramanan, Analyzing 3d objects in cluttered images, NIPS, 2012

M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman, Blocks That Shout: Distinctive Parts for Scene Classification, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.124

J. D. Keeler, D. E. Rumelhart, and W. K. Leow, Integrated segmentation and recognition of hand-printed numerals, NIPS, 1991.

D. Kotzias, M. Denil, P. Blunsom, and N. De-freitas, Deep multi-instance transfer learning. CoRR, abs/1411, p.3128, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, 2004.

K. J. Lang and G. E. Hinton, A time delay neural network architecture for speech recognition, 1988.

K. J. Lang, A. H. Waibel, and G. E. Hinton, A time-delay neural network architecture for isolated word recognition, Neural Networks, vol.3, issue.1, pp.23-43, 1990.
DOI : 10.1016/0893-6080(90)90044-L

Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard et al., Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, vol.1, issue.4, pp.541-551, 1989.
DOI : 10.1007/BF00133697

Y. J. Lee and K. Grauman, Learning the easy things first: Self-paced visual category discovery, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995523

M. Lin, Q. Chen, and S. Yan, Network in network, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01460127

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common Objects in Context, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_48

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2006.
DOI : 10.1109/CVPR.2014.222
URL : https://hal.archives-ouvertes.fr/hal-00911179

V. Ordonez, G. Kulkarni, and T. Berg, Im2text: Describing images using 1 million captioned photographs, NIPS, 2011.

M. Pandey and S. Lazebnik, Scene recognition and weakly supervised object localization with deformable part-based models, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126383

G. Papandreou, I. Kokkinos, and P. Savalle, Untangling Local and Global Deformations in Deep Convolutional Networks for Image Classification and Sliding Window Detection, CVPR, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01109289

F. Perronnin, J. Sánchez, and T. Mensink, Improving the Fisher Kernel for Large-Scale Image Classification, ECCV, 2010.
DOI : 10.1007/978-3-642-15561-1_11
URL : https://hal.archives-ouvertes.fr/inria-00548630

A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, Learning object class detectors from weakly annotated video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6248065
URL : https://hal.archives-ouvertes.fr/hal-00695940

A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN features off-the-shelf: an astounding baseline for recognition . arXiv preprint, 2014.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus et al., Overfeat: Integrated recognition, localization and detection using convolutional networks, 2005.

A. Shrivastava and A. Gupta, Building Part-Based Object Detectors via 3D Geometry, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.219
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.644.5026

K. Simonyan, A. Vedaldi, and A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312, 2013.

S. Singh, A. Gupta, and A. A. Efros, Unsupervised Discovery of Mid-Level Discriminative Patches, ECCV, 2012.
DOI : 10.1007/978-3-642-33709-3_6

J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, Discovering object categories in image collections, ICCV, 2005.

J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos, Proceedings Ninth IEEE International Conference on Computer Vision, 2003.
DOI : 10.1109/ICCV.2003.1238663

H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui et al., On learning to localize objects with minimal supervision, ICML, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00996849

Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, Contextualizing object detection and classification, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995330
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.660.6015

A. Toshev and C. Szegedy, DeepPose: Human Pose Estimation via Deep Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.214

K. Van-de-sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders, Segmentation as selective search for object recognition, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126456

P. Viola, J. Platt, and C. Zhang, Multiple instance boosting for object detection, NIPS, 2005.

C. Wang, W. Ren, K. Huang, and T. Tan, Weakly Supervised Object Localization with Latent Category Learning, ECCV. 2014
DOI : 10.1007/978-3-319-10599-4_28

Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong et al., Cnn: Single-label to multi-label, p.6, 2014.

J. Winn and N. Jojic, LOCUS: learning object classes with unsupervised segmentation, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, 2005.
DOI : 10.1109/ICCV.2005.148

P. Yadollahpour, D. Batra, and G. Shakhnarovich, Discriminative Re-ranking of Diverse Segmentations, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.251

Y. Yang and D. Ramanan, Articulated pose estimation with flexible mixtures-of-parts, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995741

M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks
DOI : 10.1007/978-3-319-10590-1_53

J. Zhang, M. Marsza?ek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study, International Journal of Computer Vision, vol.36, issue.1, pp.213-238, 2007.
DOI : 10.1007/s11263-006-9794-4
URL : https://hal.archives-ouvertes.fr/inria-00548574

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Object detectors emerge in deep scene cnns. CoRR, abs/1412, p.8, 2014.