J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, CVPR, 2017.

V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, Potion: Pose motion representation for action recognition, CVPR, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01764222

D. Erhan, Y. Bengio, A. Courville, and P. Vincent, Visualizing higher-layer features of a deep network, 2009.

C. Feichtenhofer and H. Fan, Jitendra Malik, and Kaiming He. SlowFast Networks for Video Recognition, 2018.

R. Girdhar and D. Ramanan, Attentional pooling for action recognition, NIPS, 2017.

R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, Video Action Transformer Network, 2018.

R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal et al., Something Something" video database for learning and evaluating visual common sense, ICCV, 2017.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, 2016.

V. Fabian-caba-heilbron, B. Escorcia, J. C. Ghanem, and . Niebles, ActivityNet: A large-scale video benchmark for human activity understanding, CVPR, 2015.

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, Spatial transformer networks, NIPS, 2015.

A. Katharopoulos and F. Fleuret, Processing Megapixel Images with Deep Attention-Sampling Models, ICML, 2019.

Z. Li, E. Gavves, M. Jain, G. M. Cees, and . Snoek, VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 2018.

J. Lin, C. Gan, and S. Han, Temporal shift module for efficient video understanding, 2018.

X. Long, C. Gan, G. De-melo, J. Wu, X. Liu et al., Attention clusters: Purely attention based local feature integration for video classification, CVPR, 2018.

M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal et al., Moments in time dataset: one million videos for event understanding, 2019.

R. Ramprasaath, M. Selvaraju, A. Cogswell, R. Das, D. Vedantam et al., Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization, ICCV, 2017.

S. Sharma, R. Kiros, and R. Salakhutdinov, Action recognition using visual attention, ICLR (workshop track), 2016.

A. Gunnar, A. Sigurdsson, . Gupta, and . Pyvideoresearch, , 2018.

A. Gunnar, G. Sigurdsson, X. Varol, A. Wang, I. Farhadi et al., Hollywood in homes: Crowdsourcing data collection for activity understanding, ECCV, 2016.

S. Gunnar-a-sigurdsson, A. Divvala, A. Farhadi, and . Gupta, Asynchronous temporal fields for action recognition, CVPR, 2017.

A. Gunnar, A. Sigurdsson, C. Gupta, A. Schmid, K. Farhadi et al., Charades-ego: A large-scale dataset of paired third and first person videos, 2018.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2014.

K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR, 2015.

K. Simonyan, A. Vedaldi, and A. Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, 2013.

J. Tobias-springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for simplicity: The all convolutional net, ICLR (workshop track), 2015.

S. Sudhakaran and O. Lanz, Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition, BMVC, 2018.

C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar et al., Actor-centric relation network, ECCV, 2018.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3d convolutional networks, ICCV, 2015.

G. Varol, I. Laptev, and C. Schmid, Long-term temporal convolutions for action recognition, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01241518

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention Is All You Need, NIPS, 2017.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal segment networks: Towards good practices for deep action recognition, ECCV, 2016.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal segment networks for action recognition in videos, 2018.

X. Wang and A. Gupta, Videos as space-time region graphs, ECCV, 2018.

X. Wang and R. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks, CVPR, 2018.

Y. Wang, L. Jiang, M. Yang, L. Li, M. Long et al., Eidetic 3D LSTM: A Model for Video Prediction and Beyond, ICLR, 2019.

. Chao-yuan, C. Wu, H. Feichtenhofer, K. Fan, P. He et al., Long-Term Feature Banks for Detailed Video Understanding, CVPR, 2019.

S. Yeung, O. Russakovsky, G. Mori, and L. Fei-fei, End-to-end learning of action detection from frame glimpses in videos, CVPR, 2016.

J. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga et al., Beyond short snippets: Deep networks for video classification, CVPR, 2015.

D. Matthew, R. Zeiler, and . Fergus, Visualizing and understanding convolutional networks, ECCV, 2014.

H. Zhang, D. Liu, and Z. Xiong, Two-Stream Oriented Video SuperResolution for Action Recognition, 2019.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning Deep Features for Discriminative Localization, CVPR, 2016.

B. Zhou, A. Andonian, A. Oliva, and A. Torralba, Temporal relational reasoning in videos, ECCV, 2018.

M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection, ICCV, 2017.