A. Ghias, J. Logan, D. Chamberlin, and B. Smith, Query by humming, Proceedings of the third ACM international conference on Multimedia , MULTIMEDIA '95, 1995.
DOI : 10.1145/217279.215273

R. Dannenberg, An intelligent multi-track audio editor, Proc. ICMC, 2007.

V. Emiya, R. Badeau, and B. David, Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle, IEEE Transactions on Audio, Speech, and Language Processing, vol.18, issue.6, pp.1643-1654, 2010.
DOI : 10.1109/TASL.2009.2038819

URL : https://hal.archives-ouvertes.fr/inria-00510392

A. Arzt, G. Widmer, and S. Dixon, Automatic page turning for musicians via real-time machine listening, Proc. ECAI, 2008.

C. Raphael, A Bayesian network for real-time musical accompaniment, Adv. NIPS, 2001.

N. Orio, S. Lemouton, and D. Schwarz, Score following: State of the art and new developments, Proc. NIME, 2003.

A. Cont, D. Schwarz, N. Schnell, and C. Raphael, Evaluation of real-time audio-to-score alignment, Proc. ISMIR, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00839068

M. Müller, F. Kurth, and M. Clausen, Audio matching via chroma-based statistical features, ISMIR, p.6, 2005.

S. Dixon and G. Widmer, MATCH: A music alignment tool chest, Proc. ISMIR, 2005.

I. Ozgür and R. Dannenberg, Understanding features and distance functions for music sequence alignment, Proc. ISMIR, 2010.

S. Ewert, M. Müller, and P. Grosche, High resolution audio synchronization using chroma onset features, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.1869-1872, 2009.
DOI : 10.1109/ICASSP.2009.4959972

C. Joder, S. Essid, and G. Richard, Learning Optimal Features for Polyphonic Audio-to-Score Alignment, IEEE Transactions on Audio, Speech, and Language Processing, vol.21, issue.10, pp.2118-2128, 2013.
DOI : 10.1109/TASL.2013.2266794

J. Keshet, S. Shalev-shwartz, Y. Singer, and D. Chazan, A Large Margin Algorithm for Speech-to-Phoneme and Music-to-Score Alignment, IEEE Transactions on Audio, Speech and Language Processing, vol.15, issue.8, pp.2373-2382, 2007.
DOI : 10.1109/TASL.2007.903928

Y. Guo and D. Schuurmans, Convex relaxations of latent variable training, Adv. NIPS, 2007.

F. Bach, Z. Harchaoui, P. Bojanowski, R. Lajugie, F. Bach et al., DIFFRAC: a discriminative and flexible framework for clustering Weakly supervised action labeling in videos under ordering constraints, Adv. NIPS Proc. ECCV, 2008.

A. Joulin, F. Bach, and J. Ponce, Discriminative clustering for image co-segmentation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
DOI : 10.1109/CVPR.2010.5539868

A. Joulin, K. Tang, and L. Fei-fei, Efficient Image and Video Co-localization with Frank-Wolfe Algorithm, Proc. ECCV, 2014.
DOI : 10.1007/978-3-319-10599-4_17

E. Grave, A convex relaxation for weakly supervised relation extraction, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
DOI : 10.3115/v1/D14-1166

URL : https://hal.archives-ouvertes.fr/hal-01080310

P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev et al., Weakly-Supervised Alignment of Video with Text, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.507

URL : https://hal.archives-ouvertes.fr/hal-01154523

H. Sakoe and S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE Trans. on ASLP, vol.26, issue.1, pp.43-49, 1978.

R. Lajugie, D. Garreau, S. Arlot, and F. Bach, Metric learning for temporal sequence alignment, Adv. NIPS, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01062130

M. Kennedy, The Oxford dictionary of music, 1994.

M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly, vol.3, issue.1-2, pp.95-110, 1956.
DOI : 10.1002/nav.3800030109

M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization, Proc. ICML, 2013.

T. Eerola and P. Toiviainen, Finnish folk song database, 2004.

O. Lartillot and P. Toiviainen, A Matlab toolbox for musical feature extraction from audio, Proc. DAFx, 2007.