Y. Lecun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol.9, issue.7553, pp.436-444, 2015.
DOI : 10.1007/s10994-013-5335-x

. Fig, Comparison for a 784-80-70-60-50-40-30-20-10 network

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, issue.2, 1994.
DOI : 10.1109/72.279181
URL : http://www.research.microsoft.com/~patrice/PDF/long_term.pdf

L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, Tech. Rep, 2016.

N. Qian, On the momentum term in gradient descent learning algorithms, Neural Networks, vol.12, issue.1, pp.145-151, 1999.
DOI : 10.1016/S0893-6080(98)00116-6

Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2), Soviet Mathematics Doklady, pp.372-376, 1983.

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res, vol.12, pp.2121-2159, 2011.

M. D. Zeiler, ADADELTA: An adaptive learning rate method Available online at https, 2012.

T. Tieleman and G. Hinton, Lecture 6.5 ? RMSProp: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 2012.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, Int. Conf. Learn. Representations, pp.14-16, 2014.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization and momentum in deep learning, Int. Conf, pp.16-21, 2013.

G. Hinton and R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science, vol.313, issue.5786, pp.504-507, 2006.
DOI : 10.1126/science.1127647

J. Martens, Deep learning via Hessian-free optimization, " in Int, Conf. Mach. Learn, pp.21-24, 2010.
DOI : 10.1007/978-3-642-35289-8_27
URL : http://www.cs.toronto.edu/~jmartens/docs/HF_book_chapter.pdf

J. Martens and I. Sustskever, Learning recurrent neural networks with hessian-free optimization, Proc. Int'l Conf. Machine Learning, 2011.
DOI : 10.1007/978-3-642-35289-8_27
URL : http://www.cs.toronto.edu/~jmartens/docs/HF_book_chapter.pdf

O. Vinyals and D. Povey, Krylov subspace descent for deep learning, Int. Conf. Artif. Intell. Statist, pp.21-23, 2012.

Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli et al., Identifying and attacking the saddle point problem in highdimensional non-convex optimization, Ann. Conf. Neur. Inform. Proc. Syst, pp.8-11, 2014.

J. J. More and D. C. Sorensen, Computing a Trust Region Step, SIAM Journal on Scientific and Statistical Computing, vol.4, issue.3, pp.553-572, 1983.
DOI : 10.1137/0904038

Y. Yuan, Recent advances in trust region algorithms, Mathematical Programming, vol.146, issue.6, pp.249-281, 2015.
DOI : 10.1007/s10107-013-0679-3

E. Chouzenoux and J. Pesquet, A Stochastic Majorize-Minimize Subspace Algorithm for Online Penalized Least Squares Estimation, IEEE Transactions on Signal Processing, vol.65, issue.18, 2017.
DOI : 10.1109/TSP.2017.2709265
URL : https://hal.archives-ouvertes.fr/hal-01613204

J. B. Erway and P. E. Gill, A Subspace Minimization Method for the Trust-Region Step, SIAM Journal on Optimization, vol.20, issue.3, pp.1439-1461, 2010.
DOI : 10.1137/08072440X

H. A. Pearlmuter, Fast Exact Multiplication by the Hessian, Neural Computation, vol.6, issue.1, pp.147-160, 1994.
DOI : 10.1109/PROC.1976.10286

Y. Nesterov and B. T. Polyak, Cubic regularization of Newton method and its global performance, Mathematical Programming, vol.99, issue.1, pp.177-205, 2006.
DOI : 10.1007/s10107-006-0706-8