J. Andreoli, Convolution, attention and structure embedding, 2019.

D. Bahdanau, K. Cho, and J. Bengio, Neural machine translation by jointly learning to align and translate, International Conference on Learning Representations (ICLR), 2015.

A. Barla, F. Odone, and A. Verri, Histogram intersesction kernel for image classification, Proceedings 2003 International Conference on Image Processing, p.513, 2003.

R. Bhatia, T. Jain, and Y. Lim, On the bures-wasserstein distance between positive definite matrices, Expositiones Mathematicae, 2018.

A. Buades, B. Coll, and J. Morel, Non-Local Means Denoising, Image Processing On Line, vol.1, pp.208-212, 2011.

D. Chen, L. Jacob, and J. Mairal, Biological sequence modeling with convolutional kernel networks, Bioinformatics, issue.18, pp.3294-3302, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01632912

D. Chen, L. Jacob, and J. Mairal, Recurrent kernel networks, Advances in Neural Information Processing Systems (NeurIPS), 2019.
URL : https://hal.archives-ouvertes.fr/hal-02151135

L. Chen, G. Wang, C. Tao, D. Shen, P. Cheng et al., Improving textual network embedding with global attention via optimal transport, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019.

J. Cordonnier, A. Loukas, and M. Jaggi, On the relationship between self-attention and convolutional layers, International Conference on Learning Representations (ICLR), p.2020

M. Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, Advances in Neural Information Processing Systems (NeurIPS), 2013.

M. Cuturi and A. Doucet, Fast computation of wasserstein barycenters, International Conference on Machine Learning (ICML), 2013.

Z. Dai, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, Transformer-xl: Attentive language models beyond a fixed-length context, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019.

J. Devlin, M. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.

K. Grauman and T. Darell, The pyramid match kernel: Efficient learning with sets of features, Journal of Machine Learning Research, pp.725-760, 2007.

J. Hou, B. Adhikari, and J. Cheng, Deepsf: deep convolutional neural network for mappingprotein sequences to folds, Bioinformatics, issue.8, pp.1295-1303, 2019.

H. Jégou, D. Matthijs, and C. Schmid, On the burstiness of visual elements, Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

C. K. Williams and S. Matthias, Using the nyström method to speed up kernel machines, Advances in Neural Information Processing Systems (NeurIPS), 2001.

N. Kitaev, L. Kaiser, and A. Levskaya, Reformer: The efficient transformer, International Conference on Learning Representations (ICLR, p.2020

P. P. Kuksa, P. Huang, and V. Pavlovic, Scalable algorithms for string kernels with inexact matching, Advances in Neural Information Processing Systems (NeurIPS), 2009.

M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger, From word embeddings to document distances, International Conference on Machine Learning (ICML), 2015.

C. Leslie, E. Eskin, and W. S. Noble, The spectrum kernel: a string kernel for svm protein classification, Proceedings of the Pacific Symposium on Biocomputing, pp.564-575, 2002.

C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble, Mismatch string kernels for discriminative protein classification, Bioinformatics, vol.20, issue.4, pp.467-476, 2004.

S. Lyu, Mercer kernels for object recognition with local features, Conference on Computer Vision and Pattern Recognition (CVPR), 2004.

J. , End-to-end kernel learning with supervised convolutional kernel networks, Advances in Neural Information Processing Systems (NeurIPS), 2016.

J. , Cyanure: An open-source toolbox for empirical risk minimization for python, C++, and soon more, 2019.

J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, Convolutional kernel networks, Advances in Neural Information Processing Systems (NeurIPS), 2014.
URL : https://hal.archives-ouvertes.fr/hal-01005489

P. Michel, O. Levy, and G. Neubig, Are sixteen heads really better than one?, Advances in Neural Information Processing Systems (NeurIPS), 2019.

Q. Mérigot, A. Delalande, and F. Chazal, Quantitative stability of optimal transport maps and linearization of the 2-wasserstein space, International Conference on Artificial Intelligence and Statistics (AISTATS), p.2020

G. Peyré and M. Cuturi, Computational optimal transport. Foundations and Trends in Machine Learning, vol.11, pp.355-206, 2019.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang et al., Exploring the limits of transfer learning with a unified text-to-text transformer, 2019.

A. Raganato, Y. Scherrer, and T. Jörg, Fixed encoder self-attention patterns in transformer-based machine translation, 2020.

P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levsakaya et al., Stand-alone selfattention in vision models, Advances in Neural Information Processing Systems (NeurIPS), 2019.

A. Rives, S. Goyal, J. Meier, D. Guo, M. Ott et al., Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, p.622803, 2019.

Y. Rubner, C. Tomasi, and L. J. Guibad, The earth mover's distance as a metric for image retrieval, International Journal of Computer Vision, vol.40, pp.99-121, 2000.

B. Scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond, 2001.

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning et al., Recursive deep models for semantic compositionality over a sentiment treebank, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.

M. Togninalli, E. Ghisu, F. Llinares-lópez, B. Rieck, and K. Borgwardt, Wasserstein weisfeiler-lehman graph kernels, Advances in Neural Information Processing Systems (NeurIPS), 2019.

G. Tolias, Y. Avrithis, and H. Jégou, To aggregate or not to aggregate: Selective match kernels for image search, Proceedings of the International Conference on Computer Vision (ICCV), 2013.
URL : https://hal.archives-ouvertes.fr/hal-00864684

Y. H. Tsai, S. Bai, M. Yamada, L. Morency, and R. Salakhutdinov, Transformer dissection: A unified understanding of transformer's attention via the lens of kernel, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., Attention is all you need, Advances in Neural Information Processing Systems (NeurIPS), 2017.

E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019.

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy et al., Glue: a multi-task benchmark and analysis platform for natural language understanding, International Conference on Learning Representations (ICLR), 2019.

W. Wang, D. Slepcev, S. Basu, J. A. Ozolek, and G. K. Rohde, A linear optimal transportation framework for quantifying and visualizing variations in sets of images, International Journal of Computer Vision, vol.101, issue.2, pp.254-269, 2013.

X. Wang, R. B. Girshick, A. Gupta, and K. He, Non-local neural networks, Proceedings of the Conference on Computer Vision and Pattern Recognition, 2017.

Y. Weiqiu, S. Sun, and M. Iyyer, Hard-coded gaussian attention for neural machine translation, Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), p.2020

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue et al., Huggingface's transformers: State-of-the-art natural language processing, 2019.

J. Zhou and O. G. Troyanskaya, Predicting effects of noncoding variants with deep learning-based sequence model, Nature methods, vol.12, issue.10, pp.931-934, 2015.

J. Zhou and O. G. Troyanskaya, On the definiteness of earth mover's distance and its relation to set intersection, IEEE Transactions on Cybernetics, vol.48, issue.11, pp.3184-3196, 2018.