R. Bollapragada, D. Mudigere, J. Nocedal, H. M. Shi, and P. T. Tang, A progressive batching L-BFGS method for machine learning. arXiv preprint, 2018.

C. Brezinski and M. R. Zaglia, Extrapolation methods, Applied Numerical Mathematics, vol.15, issue.2, 2013.
DOI : 10.1016/0168-9274(94)00015-8

URL : https://hal.archives-ouvertes.fr/hal-00018524

S. Cabay and L. Jackson, A Polynomial Extrapolation Method for Finding Limits and Antilimits of Vector Sequences, SIAM Journal on Numerical Analysis, vol.13, issue.5, pp.734-752, 1976.
DOI : 10.1137/0713060

L. Deng, J. Li, J. Huang, K. Yao, D. Yu et al., Recent advances in deep learning for speech research at Microsoft, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.8604-8608, 2013.
DOI : 10.1109/ICASSP.2013.6639345

URL : http://research.microsoft.com/pubs/188864/ICASSP-2013-OverviewMSRDeepLearning.pdf

R. Eddy, Extrapolating to the limit of a vector sequence, Information linkage between applied mathematics and industry, pp.387-396, 1979.

G. H. Golub and R. S. Varga, Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order Richardson iterative methods, Numerische Mathematik, vol.3, issue.1, pp.157-168, 1961.
DOI : 10.1007/BF01386014

P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski et al., Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint, 2017.

I. Guyon, Design of experiments of the nips 2003 variable selection benchmark, 2003.

P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, Averaging weights leads to wider optima and better generalization. arXiv preprint, 2018.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv preprint, 2014.

E. Moulines and F. Bach, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, Advances in Neural Information Processing Systems, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00608041

Y. Nesterov, Introductory lectures on convex optimization: A basic course, 2013.
DOI : 10.1007/978-1-4419-8853-9

B. T. Polyak and A. B. Juditsky, Acceleration of Stochastic Approximation by Averaging, SIAM Journal on Control and Optimization, vol.30, issue.4, pp.838-855, 1992.
DOI : 10.1137/0330046

S. J. Reddi, S. Kale, and S. Kumar, On the convergence of adam and beyond, International Conference on Learning Representations, 2018.

D. Scieur, F. Bach, and A. , Nonlinear acceleration of stochastic algorithms, Advances in Neural Information Processing Systems, pp.3985-3994, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01618379

D. Scieur, A. Aspremont, and F. Bach, Regularized nonlinear acceleration, Advances In Neural Information Processing Systems, pp.712-720, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01384682

D. Scieur, E. Oyallon, A. Aspremont, and F. Bach, Nonlinear acceleration of cnns, Workshop track of International Conference on Learning Representations (ICLR), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01805251

T. Tieleman and G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning, pp.26-31, 2012.

H. F. Walker and P. Ni, Anderson Acceleration for Fixed-Point Iterations, SIAM Journal on Numerical Analysis, vol.49, issue.4, pp.1715-1735, 2011.
DOI : 10.1137/10078356X

URL : http://users.wpi.edu/%7Ewalker/Papers/Walker-Ni%2CSINUM%2CV49%2C1715-1735.pdf

Z. Zhou, J. Wu, and W. Tang, Ensembling neural networks: Many could be better than all, Artificial Intelligence, vol.137, issue.1-2, pp.239-263, 2002.
DOI : 10.1016/S0004-3702(02)00190-X

URL : https://doi.org/10.1016/s0004-3702(02)00190-x