F. Bach and Z. Harchaoui, DIFFRAC: A discriminative and flexible framework for clustering, NIPS, p.7, 2007.

P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid et al., Finding Actors and Actions in Movies, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.283

URL : https://hal.archives-ouvertes.fr/hal-00904991

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce et al., Weakly Supervised Action Labeling in Videos under Ordering Constraints, ECCV, 2006.
DOI : 10.1007/978-3-319-10602-1_41

URL : https://hal.archives-ouvertes.fr/hal-01053967

P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev et al., Weakly-Supervised Alignment of Video with Text, 2015 IEEE International Conference on Computer Vision (ICCV), 2006.
DOI : 10.1109/ICCV.2015.507

URL : https://hal.archives-ouvertes.fr/hal-01154523

N. Chambers and D. Jurafsky, Unsupervised learning of narrative event chains, ACL, 2008.

V. Chari, S. Lacoste-julien, I. Laptev, and J. Sivic, On pairwise costs for network flow multi-object tracking, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299193

URL : http://arxiv.org/abs/1408.3304

M. Cimpoi, S. Maji, and A. Vedaldi, Deep filter banks for texture recognition and segmentation, CVPR, 2015.
DOI : 10.1109/cvpr.2015.7299007

URL : https://hal.archives-ouvertes.fr/hal-01263622

M. De-marneffe, B. Maccartney, and C. D. Manning, Generating typed dependency parses from phrase structure parses, LREC, p.9, 2006.

O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, 2009 IEEE 12th International Conference on Computer Vision, 2009.
DOI : 10.1109/ICCV.2009.5459279

C. Fellbaum, Wordnet: An electronic lexical database, 1998.

L. Frermann, I. Titov, and M. Pinkal, A Hierarchical Bayesian Model for Unsupervised Induction of Script Knowledge, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 2014.
DOI : 10.3115/v1/E14-1006

D. G. Higgins and P. M. Sharp, CLUSTAL: a package for performing multiple sequence alignment on a microcomputer, Gene, vol.73, issue.1, p.9, 1988.
DOI : 10.1016/0378-1119(88)90330-7

D. Hsu, S. M. Kakade, and T. Zhang, Random Design Analysis of Ridge Regression, Foundations of Computational Mathematics, vol.17, issue.36, p.2014
DOI : 10.1162/0899766054323008

M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, ICML, 2013.

A. Joulin, K. Tang, and L. Fei-fei, Efficient image and video colocalization with Frank-Wolfe algorithm, ECCV, 2014.
DOI : 10.1007/978-3-319-10599-4_17

URL : http://ai.stanford.edu/%7Ekdtang/papers/eccv14-vidcoloc.pdf

S. Lacoste-julien, Convergence rate of Frank-Wolfe for non-convex objectives. arXiv preprint, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01415335

S. Lacoste-julien and M. Jaggi, On the global linear convergence of Frank-Wolfe optimization variants, NIPS, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01248675

S. Lacoste-julien, M. Jaggi, M. Schmidt, and P. Pletscher, Blockcoordinate Frank-Wolfe optimization for structural SVMs, Proceedings of the International Conference on Machine Learning (ICML), 2013.
URL : https://hal.archives-ouvertes.fr/hal-00720158

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

C. Lee, C. Grasso, and M. Sharlow, Multiple sequence alignment using partial order graphs, Bioinformatics, vol.18, issue.3, 2002.
DOI : 10.1093/bioinformatics/18.3.452

URL : https://academic.oup.com/bioinformatics/article-pdf/18/3/452/648375/180452.pdf

T. Liao, Clustering of time series data, a survey, Pattern recognition, issue.10, 2014.

J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich et al., What???s Cookin???? Interpreting Cooking Videos using Text, Speech and Vision, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
DOI : 10.3115/v1/N15-1015

URL : http://arxiv.org/abs/1503.01558

A. Miech, J. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic, Learning from video and text via large-scale discriminative clustering, ICCV, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01569540

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, NIPS, 2013.

G. A. Miller, Wordnet: A lexical database for english, Communications of the ACM, issue.5, 1995.
DOI : 10.1145/219717.219748

URL : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1823&rep=rep1&type=pdf

I. Naim, Y. Song, Q. Liu, L. Huang, H. Kautz et al., Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015.
DOI : 10.3115/v1/N15-1017

J. C. Niebles, C. Chen, and L. Fei-fei, Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification, ECCV, 2010.
DOI : 10.1007/978-3-642-15552-9_29

J. C. Niebles, H. Wang, and L. Fei-fei, Unsupervised learning of human action categories using spatial-temporal words. IJCV, 2008.

A. Osokin, J. Alayrac, I. Lukasewitz, P. K. Dokania, and S. Lacoste-julien, Minding the gaps for block Frank-Wolfe optimization of structured SVMs, Proceedings of The 33rd International Conference of Machine Learning (ICML), 2016.
URL : https://hal.archives-ouvertes.fr/hal-01323727

D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, Category-Specific Video Summarization, ECCV, 2014.
DOI : 10.1007/978-3-319-10599-4_35

URL : https://hal.archives-ouvertes.fr/hal-01022967

M. Raptis and L. Sigal, Poselet Key-Framing: A Model for Human Activity Recognition, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.342

M. Regneri, A. Koller, and M. Pinkal, Learning script knowledge with Web experiments, ACL, 2004.

O. Sener, A. Zamir, S. Savarese, and A. Saxena, Unsupervised Semantic Parsing of Video Collections, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.509

URL : http://arxiv.org/abs/1506.08438

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, ICLR, 2015.

M. Sun, A. Farhadi, and S. Seitz, Ranking Domain-Specific Highlights by Analyzing Edited Videos, ECCV, 2014.
DOI : 10.1007/978-3-319-10590-1_51

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.441

URL : https://hal.archives-ouvertes.fr/hal-00873267

L. Wang and T. Jiang, On the Complexity of Multiple Sequence Alignment, Journal of Computational Biology, vol.1, issue.4, pp.337-348, 1994.
DOI : 10.1089/cmb.1994.1.337

. Jean-baptiste, Alayrac received the MS degree in computer science in Ecole Normale SupérieureSup´Supérieure (ENS), in Paris in 2014. He is currently working toward the PhD degree in the research teams WILLOW and SIERRA at INRIA Paris under the supervision of Josef Sivic, Ivan Laptev and Simon Lacoste-Julien, His research focuses on structured prediction from vision and natural language

P. Bojanowski and P. , Bojanowski is a Post Doctoral Researcher at Facebook AI Research He graduated from a Ph.D. at Willow team at INRIA Paris in 2016 where he was supervised by, His work focuses on automatic video and image understanding