R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.81

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus et al., OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks, ICLR, 2014.

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012.

C. Farabet, C. Couprie, L. Najman, and Y. Lecun, Learning Hierarchical Features for Scene Labeling, PAMI, 2013.
DOI : 10.1109/TPAMI.2012.231

URL : https://hal.archives-ouvertes.fr/hal-00742077

C. Couprie, F. Clément, L. Najman, and Y. Lecun, Indoor Semantic Segmentation using depth information, ICLR, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00805105

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol.86, issue.11, pp.2278-2324, 1998.
DOI : 10.1109/5.726791

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.138.1115

S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, R. Gülçehre et al., Combining modality specific deep neural networks for emotion recognition in video, ICMI, 2013.

Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.220

M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification, Procedings of the British Machine Vision Conference 2012, 2012.
DOI : 10.5244/C.26.124

URL : https://hal.archives-ouvertes.fr/hal-01353046

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.223

K. Simonyan and A. Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, 2014.

A. Jain, J. Tompson, Y. Lecun, and C. Bregler, MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation, ACCV, 2014.
DOI : 10.1007/978-3-319-16808-1_21

S. Escalera, X. Baró, J. Gonzàlez, M. Bautista, M. Madadi et al., ChaLearn Looking at People Challenge 2014: Dataset and Results, ECCVW, 2014.
DOI : 10.1007/978-3-319-16178-5_32

URL : https://hal.archives-ouvertes.fr/hal-01381162

H. Wang, A. Kläser, C. Schmid, and C. Liu, Dense Trajectories and Motion Boundary Descriptors for Action Recognition, International Journal of Computer Vision, vol.73, issue.2, 2013.
DOI : 10.1007/s11263-012-0594-8

URL : https://hal.archives-ouvertes.fr/hal-00725627

H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, Evaluation of local spatio-temporal features for action recognition, Procedings of the British Machine Vision Conference 2009, 2009.
DOI : 10.5244/C.23.124

URL : https://hal.archives-ouvertes.fr/inria-00439769

P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-Temporal Features, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.
DOI : 10.1109/VSPETS.2005.1570899

I. Laptev, M. Marsza?ek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

A. Kläser, M. Marsza?ek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, 2008.
DOI : 10.5244/C.22.99

G. Willems, T. Tuytelaars, and L. Gool, An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector, ECCV, 2008.
DOI : 10.1007/978-3-540-88688-4_48

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio et al., Real-time human pose recognition in parts from single depth images, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995316

C. Keskin, F. Kiraç, Y. Kara, and L. Akarun, Real time hand pose estimation using depth sensors, ICCV Workshop, 2011.

D. Tang, T. Yu, and T. Kim, Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.400

J. Tompson, M. Stein, Y. Lecun, and K. Perlin, Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks, ACM Transaction on Graphics, 2014.
DOI : 10.1145/2629500

N. Neverova, C. Wolf, G. Taylor, and F. Nebout, Hand Segmentation with Structured Convolutional Learning, ACCV, 2014.
DOI : 10.1007/978-3-319-16811-1_45

URL : https://hal.archives-ouvertes.fr/hal-01419789

I. Oikonomidis, N. Kyriazis, and A. Argyros, Efficient model-based 3D tracking of hand articulations using Kinect, Procedings of the British Machine Vision Conference 2011, 2011.
DOI : 10.5244/C.25.101

C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, Realtime and Robust Hand Tracking from Depth, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.145

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.454.4572

D. Tang, H. J. Chang, A. Tejani, and T. Kim, Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.490

F. Wang and Y. Li, Beyond Physical Connections: Tree Models in Human Pose Estimation, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.83

X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun et al., Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.254

URL : http://arxiv.org/abs/1406.2031

J. Wang, Z. Liu, Y. Wu, and J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6247813

J. Sung, C. Ponce, B. Selman, and A. Saxena, Unstructured Human Activity Detection from RGBD Images, ICRA, 2012.

X. Chen and M. Koskela, Online RGB-D gesture recognition with extreme learning machines, Proceedings of the 15th ACM on International conference on multimodal interaction, ICMI '13, 2013.
DOI : 10.1145/2522848.2532591

C. Monnier, S. German, and A. Ost, A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition, ECCVW, 2014.
DOI : 10.1007/978-3-319-16178-5_34

J. Y. Chang, Nonparametric Gesture Labeling from Multi-modal Data, ECCV Workshop, 2014.
DOI : 10.1007/978-3-319-16178-5_35

K. Nandakumar, W. K. Wah, C. S. Alice, N. W. Terence, W. J. Gang et al., A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data Categories and Subject Descriptors, ICMI Workshop, 2013.

Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, CVPR 2011, 2011.
DOI : 10.1109/CVPR.2011.5995496

URL : http://ai.stanford.edu/~quocle/LeZouYeungNg11_appendix.pdf

M. Ranzato, F. J. Huang, Y. Boureau, and Y. Lecun, Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition, 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2007.383157

B. Chen, J. Ting, B. Marlin, and N. De-freitas, Deep learning of invariant Spatio-Temporal Features from Video, NIPSW, 2010.

P. Gehler and S. Nowozin, On feature combination for multiclass object classification, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459169

G. Ye, D. Liu, I. Jhuo, and S. Chang, Robust Late Fusion With Rank Minimization, CVPR, 2012.

D. Liu, K. Lai, G. Ye, M. Chen, and S. Chang, Sample-Specific Late Fusion for Visual Category Recognition, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.109

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.394.155

Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. Hauptmann, Feature Weighting via Optimal Thresholding for Video Analysis, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.427

URL : https://opus.lib.uts.edu.au/bitstream/10453/29571/1/2013004175OK.pdf

P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis et al., Multimodal feature fusion for robust event detection in web videos, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6247814

N. Srivastava, R. Salakhutdinov, N. Neverova, C. Wolf, G. Paci et al., Multimodal learning with Deep Boltzmann Machines A multi-scale approach to gesture detection and recognition, NIPS, 2013. [48] ICCV Workshop, 2013.

N. Neverova, C. Wolf, G. Taylor, and F. Nebout, Multi-scale Deep Learning for Gesture Detection and Localization, ECCVW, 2014.
DOI : 10.1007/978-3-319-16178-5_33

URL : https://hal.archives-ouvertes.fr/hal-01419792

M. Zanfir, M. Leordeanu, and C. Sminchisescu, The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.342

L. Deng, J. Li, J. Huang, K. Yao, D. Yu et al., Recent advances in deep learning for speech recognition at Microsoft, ICASSP, 2013.

L. A. Alexandre, A. C. Campilho, and M. Kamel, On combining classifiers using sum and product rules, Pattern Recognition Letters, pp.1283-1289, 2001.
DOI : 10.1016/S0167-8655(01)00073-3

P. Baldi and P. Sadowski, The dropout learning algorithm, Artificial Intelligence, vol.210, pp.78-122, 2014.
DOI : 10.1016/j.artint.2014.02.004

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing coadaptation of feature detectors, 2012.

S. Wang and C. Manning, Fast dropout training, ICML, 2013.

E. L. Lehmann, Elements of Large-Sample Theory, ICML, 1998.
DOI : 10.1007/b98855

P. Geurts, D. Ernst, and L. Wehenkel, Extremely randomized trees, Machine learning, pp.3-42, 2006.
DOI : 10.1007/s10994-006-6226-1

URL : https://hal.archives-ouvertes.fr/hal-00341932

A. Lee, T. Kawahara, and K. Shikano, Julius -an open source realtime large vocabulary recognition engine, Interspeech, 2001.

N. Camgoz, A. Kindiroglu, and L. Akarun, Gesture Recognition Using Template Based Random Forest Classifiers, ECCVW, 2014.
DOI : 10.1007/978-3-319-16178-5_41

G. Evangelidis, G. Singh, and R. Horaud, Continuous Gesture Recognition from Articulated Poses, ECCV Workshop, 2014.
DOI : 10.1007/978-3-319-16178-5_42

URL : https://hal.archives-ouvertes.fr/hal-01082981

X. Peng, L. Wang, and Z. Cai, Action and Gesture Temporal Spotting with Super Vector Representation, ECCVW, 2014.
DOI : 10.1007/978-3-319-16178-5_36

G. Chen, D. Clarke, M. Giuliani, D. Weikersdorfer, and A. Knoll, Multi-modality Gesture Detection and Recognition With Unsupervision , Randomization and Discrimination, ECCVW, 2014.
DOI : 10.1007/978-3-319-16178-5_43

L. Pigou, S. Dieleman, and P. Kindermans, Sign Language Recognition Using Convolutional Neural Networks, ECCVW, 2014.
DOI : 10.1007/978-3-319-16178-5_40

URL : http://hdl.handle.net/1854/LU-5796137

D. Wu, Deep Dynamic Neural Networks for Gesture Segmentation and Recognition, ECCV Workshop, 2014.
DOI : 10.1007/978-3-319-16178-5_39

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, pp.2278-2324, 1998.
DOI : 10.1109/5.726791