R. 1. Deng, J. Dong, W. Socher, R. Li, L. J. Li et al., Imagenet: A largescale hierarchical image database, In: Computer Vision and Pattern Recognition IEEE, pp.248-255, 2009.

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, Learning deep features for scene recognition using places database, Neural Information Processing Systems (NIPS), pp.487-495, 2014.

C. Heilbron, F. Escorcia, V. Ghanem, B. , C. Niebles et al., ActivityNet: A large-scale video benchmark for human activity understanding, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.961-970, 2015.
DOI : 10.1109/CVPR.2015.7298698

URL : http://repository.kaust.edu.sa/kaust/bitstream/10754/556141/1/ActivityNet_CVPR2015.pdf

J. Liu, J. Luo, and M. Shah, Recognizing realistic actions from videos in the wild

A. Gorban, H. Idrees, Y. G. Jiang, R. Zamir, A. Laptev et al., THUMOS challenge: Action recognition with a large number of classes, 2015.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Largescale video classification with convolutional neural networks, In: Computer Vision and Pattern Recognition (CVPR) IEEE, vol.4, pp.1725-1732, 2014.
DOI : 10.1109/cvpr.2014.223

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.471.3312

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, pp.2556-2563, 2011.
DOI : 10.1109/ICCV.2011.6126543

K. Soomro, R. Zamir, A. Shah, and M. , UCF101: A dataset of 101 human actions classes from videos in the wild, 2012.

I. Laptev, M. Marszaa-lek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2008.
DOI : 10.1109/CVPR.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

M. D. Rodriguez, J. Ahmed, and M. Shah, Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2008.
DOI : 10.1109/CVPR.2008.4587727

A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, A dataset for Movie Description, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298940

C. Schüldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., pp.32-36, 2004.
DOI : 10.1109/ICPR.2004.1334462

L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, Actions as Space-Time Shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, issue.12, pp.2247-2253, 2007.
DOI : 10.1109/TPAMI.2007.70711

M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, A database for fine grained activity detection of cooking activities, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.1194-1201, 2012.
DOI : 10.1109/CVPR.2012.6247801

S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. C. Chen et al., A large-scale benchmark dataset for event recognition in surveillance video, CVPR 2011, pp.3153-3160, 2011.
DOI : 10.1109/CVPR.2011.5995586

H. Kuehne, A. B. Arslan, and T. Serre, The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.105

A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal et al., Coherent Multi-sentence Video Description with Variable Level of Detail, In: Pattern Recognition, issue.4, pp.184-195, 2014.
DOI : 10.1007/978-3-319-11752-2_15

M. Marszaa-lek, I. Laptev, and C. Schmid, Actions in context, In: Computer Vision and Pattern Recognition IEEE, vol.3, 2009.

V. Ferrari, M. Marín-jiménez, and A. Zisserman, 2D Human Pose Estimation in TV Shows, pp.128-147, 2009.
DOI : 10.1007/978-3-642-03061-1_7

D. L. Chen and W. B. Dolan, Collecting highly parallel data for paraphrase evaluation, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.190-200, 2011.

A. Torabi, C. Pal, H. Larochelle, and A. Courville, Using descriptive video services to create a large data source for video annotation research

A. Gupta and L. S. Davis, Objects in Action: An Approach for Combining Action Understanding and Object Perception, 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.
DOI : 10.1109/CVPR.2007.383331

M. S. Ryoo and J. K. Aggarwal, Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities, 2009 IEEE 12th International Conference on Computer Vision, pp.1593-1600, 2009.
DOI : 10.1109/ICCV.2009.5459361

K. Tuite, N. Snavely, D. Y. Hsiao, N. Tabing, and Z. Popovic, PhotoCity, Proceedings of the 2011 annual conference on Human factors in computing systems, CHI '11, pp.1383-1392, 2011.
DOI : 10.1145/1978942.1979146

H. Pirsiavash and D. Ramanan, Detecting activities of daily living in first-person camera views, 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp.2847-2854, 2012.
DOI : 10.1109/CVPR.2012.6248010

Y. Iwashita, A. Takamine, R. Kurazume, and M. S. Ryoo, First-Person Animal Activity Recognition from Egocentric Videos, 2014 22nd International Conference on Pattern Recognition, 2014.
DOI : 10.1109/ICPR.2014.739

C. Zitnick and D. Parikh, Bringing Semantics into Focus Using Visual Abstraction, 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp.3009-3016, 2013.
DOI : 10.1109/CVPR.2013.387

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.306.7749

G. Salton and J. Michael, Introduction to modern information retrieval, pp.24-51, 1983.

G. A. Sigurdsson, O. Russakovsky, A. Farhadi, I. Laptev, and A. Gupta, Much ado about time: Exhaustive annotation of temporal data. arXiv preprint arXiv:1607, p.7429, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01431527

G. K. Zipf, The psycho-biology of language, p.7, 1935.

L. Van-der-maaten and G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research, vol.9, issue.85, pp.2579-2605, 2008.

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, p.11
DOI : 10.1109/ICCV.2013.441

URL : https://hal.archives-ouvertes.fr/hal-00873267

F. Perronnin, J. Sánchez, and T. Mensink, Improving the Fisher Kernel for Large-Scale Image Classification, European Conference on Computer Vision (ECCV), p.10, 2010.
DOI : 10.1007/978-3-642-15561-1_11

URL : https://hal.archives-ouvertes.fr/inria-00548630

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, International Conference on Learning Representations (ICLR), p.10, 2015.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, pp.1556-1566, 2014.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, Neural Information Processing Systems (NIPS), p.10, 2014.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), p.11, 2015.
DOI : 10.1109/ICCV.2015.510

URL : http://arxiv.org/abs/1412.0767

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta et al., Microsoft coco captions: Data collection and evaluation server, p.13, 2015.

J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv, pp.1505-04467, 2015.

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell et al., Sequence to Sequence -- Video to Text, 2015 IEEE International Conference on Computer Vision (ICCV), pp.4534-4542, 2015.
DOI : 10.1109/ICCV.2015.515