H. Wang, D. Oneata, J. Verbeek, and C. Schmid, A Robust and Efficient Video Representation for Action Recognition, International Journal of Computer Vision, vol.103, issue.1, 2015.
DOI : 10.1007/s11263-015-0846-5

URL : https://hal.archives-ouvertes.fr/hal-01145834

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2014.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.510

URL : http://arxiv.org/abs/1412.0767

J. Yue-hei, M. Ng, S. Hausknecht, O. Vijayanarasimhan, R. Vinyals et al., Beyond short snippets: Deep networks for video classification, CVPR, 2015.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV, 2016.
DOI : 10.1007/978-3-319-46484-8_2

URL : http://arxiv.org/abs/1608.00859

G. Gkioxari and J. Malik, Finding action tubes, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298676

P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Learning to Track for Spatio-Temporal Action Localization, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.362

URL : https://hal.archives-ouvertes.fr/hal-01159941

L. Wang, Y. Qiao, and X. Tang, Video Action Detection with Relational Dynamic-Poselets, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_37

S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, Deep learning for detecting multiple space-time action tubes in videos, BMVC, 2016.

X. Peng and C. Schmid, Multi-region Two-Stream R-CNN for Action Detection, ECCV, 2016. [Online]. Available
DOI : 10.1007/978-3-319-16178-5_32

URL : https://hal.archives-ouvertes.fr/hal-01349107

M. M. Puscas, E. Sangineto, D. Culibrk, and N. Sebe, Unsupervised Tube Extraction Using Transductive Learning and Dense Trajectories, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.193

J. C. Van-gemert, M. Jain, E. Gati, and C. G. Snoek, APT: Action localization proposals from dense trajectories, Procedings of the British Machine Vision Conference 2015, 2015.
DOI : 10.5244/C.29.177

R. G. Cinbis, J. Verbeek, and C. Schmid, Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.39, issue.1, 2016.
DOI : 10.1109/TPAMI.2016.2535231

URL : https://hal.archives-ouvertes.fr/hal-01123482

H. Bilen and A. Vedaldi, Weakly Supervised Deep Detection Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.311

URL : http://arxiv.org/abs/1511.02853

P. Mettes, J. C. Van-gemert, and C. G. Snoek, Spot On: Action Localization from Pointly-Supervised Proposals, ECCV, 2016.
DOI : 10.1007/978-3-319-46454-1_27

URL : http://arxiv.org/abs/1604.07602

A. Kläser, M. Marszalek, C. Schmid, and A. Zisserman, Human Focused Action Localization in Video, International Workshop on Sign, Gesture, and Activity (SGA), 2010.
DOI : 10.1007/978-3-642-35749-7_17

A. Prest, V. Ferrari, and C. Schmid, Explicit modeling of humanobject interactions in realistic videos, IEEE Trans. PAMI, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00720847

G. Yu and J. Yuan, Fast action proposals for human action detection and search, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298735

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS, 2015.
DOI : 10.1109/TPAMI.2016.2577031

URL : http://arxiv.org/abs/1506.01497

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, 2D Human Pose Estimation: New Benchmark and State of the Art Analysis, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.471

S. Hare, A. Saffari, and P. Torr, Struck: Structured output tracking with kernels, ICCV, 2011.
DOI : 10.1109/tpami.2015.2509974

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.5858

Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learningdetection, IEEE Trans. PAMI, 2012.

I. Laptev and P. Pérez, Retrieving actions in movies, 2007 IEEE 11th International Conference on Computer Vision, 2007.
DOI : 10.1109/ICCV.2007.4409105

L. Cao, Z. Liu, and T. S. Huang, Cross-dataset action detection, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5539875

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.459.6620

J. Yuan, Z. Liu, and Y. Wu, Discriminative subvolume search for efficient action detection, CVPR, 2009.

A. Gaidon, Z. Harchaoui, and C. Schmid, Temporal Localization of Actions with Actoms, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.11, 2013.
DOI : 10.1109/TPAMI.2013.65

URL : https://hal.archives-ouvertes.fr/hal-00804627

T. Lan, Y. Wang, and G. Mori, Discriminative figure-centric models for joint action localization and recognition, ICCV, 2011.

A. Kläser, M. Marszaek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, 2008.
DOI : 10.5244/C.22.99

M. Jain, J. Van-gemert, H. Jégou, P. Bouthemy, and C. Snoek, Action Localization with Tubelets from Motion, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.100

URL : https://hal.archives-ouvertes.fr/hal-00996844

W. Chen, C. Xiong, R. Xu, and J. Corso, Actionness Ranking with Lattice Conditional Ordinal Random Fields, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.101

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.671.8057

D. Oneata, J. Revaud, J. Verbeek, and C. Schmid, Spatio-temporal Object Detection Proposals, ECCV, 2014.
DOI : 10.1007/978-3-319-10578-9_48

URL : https://hal.archives-ouvertes.fr/hal-01021902

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce et al., Weakly Supervised Action Labeling in Videos under Ordering Constraints, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_41

URL : https://hal.archives-ouvertes.fr/hal-01053967

O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459279

M. Hoai, L. Torresani, F. De-la-torre, and C. Rother, Learning discriminative localization from weakly labeled data, Pattern Recognition, vol.47, issue.3, 2014.
DOI : 10.1016/j.patcog.2013.09.028

A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, Learning object class detectors from weakly annotated video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6248065

URL : https://hal.archives-ouvertes.fr/hal-00695940

P. Siva and T. Xiang, Weakly Supervised Action Detection, Procedings of the British Machine Vision Conference 2011, 2011.
DOI : 10.5244/C.25.65

I. Laptev, On space-time interest points, 2005.
DOI : 10.1007/s11263-005-1838-7

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.1419

E. A. Mosabbeb, R. Cabral, F. De-la-torre, and M. Fathy, Multilabel discriminative weakly-supervised human activity recognition and localization, ACCV, 2014.

S. Ma, J. Zhang, N. Ikizler-cinbis, and S. Sclaroff, Action Recognition and Localization by Hierarchical Space-Time Segments, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.341

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.663.1492

W. Chen, J. J. Corso, M. D. Rodriguez, J. Ahmed, and M. Shah, Action detection by implicit intentional motion clustering Action MACH: A spatio-temporal maximum average correlation height filter for action recognition, ICCV CVPR, 2008.

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, Towards Understanding Action Recognition, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.396

URL : https://hal.archives-ouvertes.fr/hal-00906902

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild, 2012.

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image Classification with the Fisher Vector: Theory and Practice, International Journal of Computer Vision, vol.73, issue.2, 2013.
DOI : 10.1007/s11263-013-0636-x

T. Wu, C. Lin, and R. C. Weng, Probability estimates for multi-class classification by pairwise coupling, Journal of Machine Learning Research, 2004.

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping, ECCV, 2004.
DOI : 10.1007/978-3-540-24673-2_3

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.1732

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126543

URL : http://cbcl.mit.edu/publications/ps/Kuehne_etal_iccv11.pdf

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, ICLR, 2015.