G. Chéron, I. Laptev, and C. Schmid, P-cnn: Posebased cnn features for action recognition, Proceedings of the IEEE international conference on computer vision, pp.3218-3226, 2015.

J. Redmon and A. Farhadi, YOLO9000: Better, Faster, Stronger, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.6517-6525, 2017.
DOI : 10.1109/CVPR.2017.690

URL : http://arxiv.org/pdf/1612.08242

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning Deep Features for Discriminative Localization, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.319

URL : http://arxiv.org/pdf/1512.04150

X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, Recipe recognition with large multimodal food dataset, Multimedia & Expo Workshops (ICMEW), 2015 IEEE International Conference on, pp.1-6, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01196959

L. Herranz, W. Min, and S. Jiang, Food recognition and recipe analysis: integrating visual content, context and external knowledge, 2018.

J. Chen and C. Ngo, Deep-based Ingredient Recognition for Cooking Recipe Retrieval, Proceedings of the 2016 ACM on Multimedia Conference, MM '16, pp.32-41, 2016.
DOI : 10.1109/ICMEW.2015.7169816

J. Jermsurawong and N. Habash, Predicting the Structure of Cooking Recipes, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.781-786, 2015.
DOI : 10.18653/v1/D15-1090

URL : https://doi.org/10.18653/v1/d15-1090

Y. Yamakata, S. Imahori, H. Maeta, and S. Mori, A method for extracting major workflow composed of ingredients, tools, and actions from cooking procedural text, 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp.1-6, 2016.
DOI : 10.1109/ICMEW.2016.7574705

. De-an, J. J. Huang, L. Lim, J. C. Fei-fei, and . Niebles, Unsupervised visual-linguistic reference resolution in instructional videos, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka et al., Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data, International Journal of Computer Vision, vol.34, issue.9, pp.1-28
DOI : 10.1109/ICCVW.2011.6130353

URL : http://arxiv.org/pdf/1502.06648

S. Stein and J. Mckenna, Combining embedded accelerometers with computer vision for recognizing food preparation activities, Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, UbiComp '13, 2013.
DOI : 10.1145/2493432.2493482

URL : http://cvip.computing.dundee.ac.uk/papers/Stein2013UbiComp.pdf

H. Kuehne, A. B. Arslan, and T. Serre, The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.105

A. Hashimoto, T. Sasada, Y. Yamakata, S. Mori, and M. Minoh, KUSK dataset, Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing Adjunct Publication, UbiComp '14 Adjunct, pp.583-588, 2014.
DOI : 10.1145/2638728.2641338

L. Zhou, C. Xu, and J. J. Corso, Towards automatic learning of procedures from web instructional videos. arXiv preprint, 2017.

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari et al., Scaling egocentric vision: The epic-kitchens dataset. arXiv preprint, 2018.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Advances in neural information processing systems, pp.1097-1105, 2012.
DOI : 10.1162/neco.2009.10-08-881

URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Is object localization for free? - Weakly-supervised learning with convolutional neural networks, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.685-694, 2015.
DOI : 10.1109/CVPR.2015.7298668

URL : https://hal.archives-ouvertes.fr/hal-01015140

R. Ramprasaath, A. Selvaraju, R. Das, M. Vedantam, D. Cogswell et al., Grad-cam: Visual explanations from deep networks via gradient-based localization, International Conference on Computer Vision, 2017.