F. Bach and Z. Harchaoui, DIFFRAC: A discriminative and flexible framework for clustering, NIPS, p.7, 2007.

P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid et al., Finding Actors and Actions in Movies, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.283
URL : https://hal.archives-ouvertes.fr/hal-00904991

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce et al., Weakly Supervised Action Labeling in Videos under Ordering Constraints, ECCV, 2006.
DOI : 10.1007/978-3-319-10602-1_41
URL : https://hal.archives-ouvertes.fr/hal-01053967

P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev et al., Weakly-Supervised Alignment of Video with Text, 2015 IEEE International Conference on Computer Vision (ICCV), 2006.
DOI : 10.1109/ICCV.2015.507
URL : https://hal.archives-ouvertes.fr/hal-01154523

N. Chambers and D. Jurafsky, Unsupervised learning of narrative event chains, ACL, 2008.

V. Chari, S. Lacoste-julien, I. Laptev, and J. Sivic, On pairwise costs for network flow multi-object tracking, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299193
URL : http://arxiv.org/abs/1408.3304

M. Cimpoi, S. Maji, and A. Vedaldi, Deep filter banks for texture recognition and segmentation, CVPR, 2015.
DOI : 10.1109/cvpr.2015.7299007
URL : https://hal.archives-ouvertes.fr/hal-01263622

M. De-marneffe, B. Maccartney, and C. D. Manning, Generating typed dependency parses from phrase structure parses, LREC, p.9, 2006.

O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459279

C. Fellbaum, Wordnet: An electronic lexical database, 1998.

L. Frermann, I. Titov, and M. Pinkal, A Hierarchical Bayesian Model for Unsupervised Induction of Script Knowledge, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014.
DOI : 10.3115/v1/E14-1006

D. G. Higgins and P. M. Sharp, CLUSTAL: a package for performing multiple sequence alignment on a microcomputer, Gene, vol.73, issue.1, p.9, 1988.
DOI : 10.1016/0378-1119(88)90330-7

D. Hsu, S. M. Kakade, and T. Zhang, Random Design Analysis of Ridge Regression, Foundations of Computational Mathematics, vol.17, issue.36, p.2014
DOI : 10.1162/0899766054323008

M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, ICML, 2013.

A. Joulin, K. Tang, and L. Fei-fei, Efficient image and video colocalization with Frank-Wolfe algorithm, ECCV, 2014.
DOI : 10.1007/978-3-319-10599-4_17
URL : http://ai.stanford.edu/%7Ekdtang/papers/eccv14-vidcoloc.pdf

S. Lacoste-julien, Convergence rate of Frank-Wolfe for non-convex objectives. arXiv preprint, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01415335

S. Lacoste-julien and M. Jaggi, On the global linear convergence of Frank-Wolfe optimization variants, NIPS, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01248675

S. Lacoste-julien, M. Jaggi, M. Schmidt, and P. Pletscher, Blockcoordinate Frank-Wolfe optimization for structural SVMs, Proceedings of the International Conference on Machine Learning (ICML), 2013.
URL : https://hal.archives-ouvertes.fr/hal-00720158

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587756
URL : https://hal.archives-ouvertes.fr/inria-00548659

C. Lee, C. Grasso, and M. Sharlow, Multiple sequence alignment using partial order graphs, Bioinformatics, vol.18, issue.3, 2002.
DOI : 10.1093/bioinformatics/18.3.452
URL : https://academic.oup.com/bioinformatics/article-pdf/18/3/452/648375/180452.pdf

T. Liao, Clustering of time series data, a survey, Pattern recognition, issue.10, 2014.

J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich et al., What???s Cookin???? Interpreting Cooking Videos using Text, Speech and Vision, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
DOI : 10.3115/v1/N15-1015
URL : http://arxiv.org/abs/1503.01558

A. Miech, J. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic, Learning from video and text via large-scale discriminative clustering, ICCV, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01569540

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, NIPS, 2013.

G. A. Miller, Wordnet: A lexical database for english, Communications of the ACM, issue.5, 1995.
DOI : 10.1145/219717.219748
URL : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1823&rep=rep1&type=pdf

I. Naim, Y. Song, Q. Liu, L. Huang, H. Kautz et al., Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015.
DOI : 10.3115/v1/N15-1017

J. C. Niebles, C. Chen, and L. Fei-fei, Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification, ECCV, 2010.
DOI : 10.1007/978-3-642-15552-9_29

J. C. Niebles, H. Wang, and L. Fei-fei, Unsupervised learning of human action categories using spatial-temporal words. IJCV, 2008.

A. Osokin, J. Alayrac, I. Lukasewitz, P. K. Dokania, and S. Lacoste-julien, Minding the gaps for block Frank-Wolfe optimization of structured SVMs, Proceedings of The 33rd International Conference of Machine Learning (ICML), 2016.
URL : https://hal.archives-ouvertes.fr/hal-01323727

D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, Category-Specific Video Summarization, ECCV, 2014.
DOI : 10.1007/978-3-319-10599-4_35
URL : https://hal.archives-ouvertes.fr/hal-01022967

M. Raptis and L. Sigal, Poselet Key-Framing: A Model for Human Activity Recognition, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.342

M. Regneri, A. Koller, and M. Pinkal, Learning script knowledge with Web experiments, ACL, 2004.

O. Sener, A. Zamir, S. Savarese, and A. Saxena, Unsupervised Semantic Parsing of Video Collections, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.509
URL : http://arxiv.org/abs/1506.08438

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, ICLR, 2015.

M. Sun, A. Farhadi, and S. Seitz, Ranking Domain-Specific Highlights by Analyzing Edited Videos, ECCV, 2014.
DOI : 10.1007/978-3-319-10590-1_51

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.441
URL : https://hal.archives-ouvertes.fr/hal-00873267

L. Wang and T. Jiang, On the Complexity of Multiple Sequence Alignment, Journal of Computational Biology, vol.1, issue.4, pp.337-348, 1994.
DOI : 10.1089/cmb.1994.1.337

. Jean-baptiste, Alayrac received the MS degree in computer science in Ecole Normale SupérieureSup´Supérieure (ENS), in Paris in 2014. He is currently working toward the PhD degree in the research teams WILLOW and SIERRA at INRIA Paris under the supervision of Josef Sivic, Ivan Laptev and Simon Lacoste-Julien, His research focuses on structured prediction from vision and natural language

P. Bojanowski and P. , Bojanowski is a Post Doctoral Researcher at Facebook AI Research He graduated from a Ph.D. at Willow team at INRIA Paris in 2016 where he was supervised by, His work focuses on automatic video and image understanding