O. Massé, Speed learning on the fly

A. Sutton and . Mahmood, Automatic step-size adaptation in incremental supervised learning

. Mém, J. De-mast-harold, J. Kushner, and . Yang, Analysis of adaptive step-size SA algorithms for parameter tracking, IEEE Transactions on Automatic Control, vol.40, pp.1403-1410, 1995.

. Bottou, Large-scale machine learning with stochastic gradient descent
DOI : 10.1201/b11429-4

M. Robbins, A Stochastic Approximation Method

C. Kushner, Stochastic Approximation Methods for Constrained and Unconstrained Systems
DOI : 10.1007/978-1-4684-9352-8

A. Defazio, F. Bach, and S. Lacoste-julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, Advances in Neural Information Processing Systems, 2014.

. Burges, Minimizing finite sums with the Stochastic Average Gradient, 2013.

. Defazio, L. Bach, L. Schmidt, . Roux, and . Bach, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Minimizing finite sums with the Stochastic Average Gradient, op. cit
URL : https://hal.archives-ouvertes.fr/hal-01016843

. Amari, Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2
DOI : 10.1103/PhysRevLett.76.2188

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, The Journal of Machine Learning Research, vol.12, pp.2121-2159, 2011.

Z. Schaul and L. , No More Pesky Learning Rates

D. Maclaurin, D. Duvenaud, and R. Adams, Gradient-based Hypermarameter Optimization through Reversible Learning, Proceedings of The 32nd International Conference on Machine Learning

D. O. Brian, J. B. Anderson, and . Moore, Optimal Filtering, 1979.

[. Amari, Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, 1998.
DOI : 10.1103/PhysRevLett.76.2188

[. Balle and O. Maillard, Spectral Learning from a Single Trajectory under Finite-State Policies, Proceedings of the 34th International Conference on Machine Learning Proceedings of Machine Learning Research. International Convention Centre, pp.361-370, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01590940

L. Bottou, Large-scale machine learning with stochastic gradient descent, Proceedings of COMPSTAT'2010, pp.177-186, 2010.
DOI : 10.1201/b11429-4

L. Bottou, On-line Learning and Stochastic Approximations, pp.9-42, 1999.
DOI : 10.1017/CBO9780511569920.003

URL : http://leon.bottou.org/publications/pdf/online-1998.pdf

[. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, issue.2, pp.157-166, 1994.
DOI : 10.1109/72.279181

URL : http://www.research.microsoft.com/~patrice/PDF/long_term.pdf

[. Cappé, E. Moulines, and T. Ryden, Inference in Hidden Markov Models, 2005.

N. Dau+14-]-yann and . Dauphin, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Advances in Neural Information Processing Systems, pp.2933-2941, 2014.

A. Defazio, F. Bach, and S. Lacoste-julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01016843

B. Delyon, General results on the convergence of stochastic algorithms, IEEE Transactions on Automatic Control, vol.41, issue.9, pp.1245-1255, 1996.
DOI : 10.1109/9.536495

[. Dieuleveut, N. Flammarion, and F. Bach, Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01275431

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, The Journal of Machine Learning Research, vol.12, pp.2121-2159, 2011.

M. Bernard-delyon, E. Lavielle, and . Moulines, Convergence of a stochastic approximation version of the EM algorithm, The Annals of Statistics, vol.27, pp.94-128, 1999.

J. Fort and G. Pagès, Convergence of Stochastic Algorithms : From the Kushner-Clark Theorem to the Lyapunov Functional Method, In : Advances in Applied Probability, vol.4, pp.1072-1094, 1996.

K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, vol.2, issue.5, pp.359-366, 1989.
DOI : 10.1016/0893-6080(89)90020-8

H. Jaeger, A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the " echo state network " approach. Rapp. tech. 159. German National Research Center for Information Technology, 2002.

R. Johnson and T. Zhang, Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, Advances in Neural Information Processing Systems 26, 2013.

J. Harold, D. S. Kushner, and . Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems, T. 26. Applied Mathematical Sciences, 1978.

J. Harold, G. Kushner, and . Yin, Stochastic Approximation and Recursive Algorithms and Applications, 2003.

J. Harold, J. Kushner, and . Yang, Analysis of adaptive step-size SA algorithms for parameter tracking, IEEE Transactions on Automatic Control, vol.40, pp.1403-1410, 1995.

L. Ljung, Analysis of recursive stochastic algorithms, IEEE Transactions on Automatic Control, vol.22, issue.4, pp.551-575, 1977.
DOI : 10.1109/TAC.1977.1101561

[. Löcherbach, Ergodicity and speed of convergence to equilibrium for diffusion processes " . Cours disponible sur la page web de l'auteur, à l'adresse https

A. Mahmood, Automatic step-size adaptation in incremental supervised learning " . Mém.de mast, 2010.

S. Mallat, Group Invariant Scattering, Communications on Pure and Applied Mathematics, vol.37, issue.10, pp.1331-1398, 2012.
DOI : 10.1137/S0036141002404838

URL : http://arxiv.org/pdf/1101.2286

[. Maclaurin, D. Duvenaud, and R. Adams, Gradientbased Hypermarameter Optimization through Reversible Learning, Proceedings of The 32nd International Conference on Machine Learning, 2015.

P. Massé and Y. Ollivier, Speed learning on the fly, p.preprint, 2015.

[. Ollivier, G. Charpiat, and C. Tallec, Training recurrent networks online without backtracking, p.preprint, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01228954

Y. Ollivier, Riemannian metrics for neural networks I: feedforward networks, Information and Inference, vol.12, issue.6, pp.108-153, 2015.
DOI : 10.1007/BF03037353

URL : https://hal.archives-ouvertes.fr/hal-00857982

Y. Ollivier, Riemannian metrics for neural networks II: recurrent networks and learning symbolic data sequences, Information and Inference, vol.1, issue.8, pp.153-193, 2015.
DOI : 10.1007/978-3-642-35289-8_27

URL : https://hal.archives-ouvertes.fr/hal-00857980

Y. Ollivier, Online natural gradient as a Kalman filter, p.preprint, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01660622

[. Ollivier, C. Tallec, and G. Charpiat, Training recurrent networks online without backtracking, p.preprint, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01228954

]. B. Pea95 and . Pearlmutter, Gradient calculations for dynamic recurrent neural networks: a survey, IEEE Transactions on Neural Networks, vol.6, pp.1212-1228, 1995.

[. Robbins and S. Monro, A Stochastic Approximation Method, The Annals of Mathematical Statistics, vol.223, pp.400-407, 1951.

L. Sagun, Explorations on high dimensional landscapes Article accepté pour un atelier à ICLR 2015, disponible sur arxiv à l'adresse https

[. Schmidt, N. L. Roux, and F. Bach, Minimizing finite sums with the Stochastic Average Gradient
URL : https://hal.archives-ouvertes.fr/hal-00860051

[. Schaul, S. Zhang, and Y. Lecun, No More Pesky Learning Rates, Proceedings of The 30th International Conference on Machine Learning. Sous la dir. de Sanjoy Dasgupta et David McAllester . JMLR, pp.343-351, 2013.

[. Tallec and Y. Ollivier, Unbiased Online Recurrent Optimization, p.preprint, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01666483

[. Tallec and Y. Ollivier, Unbiasing Truncated Backpropagation Through Time, p.preprint, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01660627

Y. Lecun, L'apprentissage profond : une révolution en intelligence artificielle " . Leçon inaugurale au Collège de France, disponible à l'adresse https

O. Solon, Oh the humanity! Poker computer trounces humans in big step for AI, 2017.

.. Pertes-sur-le-couple-État-paramètre, 73 5.1.1 Pertes sur le couple état-paramètre, p.73

.. Critère-d-'optimalité, Changement d'échelle de temps : construction des intervalles pour la convergence, p.90

.. Discussion-des-hypothèses-d-'optimalité, 97 6.6.1 Condition d'optimalité sur la somme des gradients 97 6.6.2 Cas de la régression linéaire avec bruit gaussien, p.98

.. Trajectoires-intermédiaires,-en-boucle-ouverte, 120 9.4.1 Définition des trajectoires intermédiaires

.. Contractivité-sur-le-paramètre, 126 9.5.1 Trajectoire intermédiaire issue des quantités initiales stables . 126 9.5.2 Horizon de contrôle de la plus grande valeur propre le long de la trajectoire stable, p.127

.. Établissement-des-conditions-pour-la-convergence, 137 10.2.1 Établissement de la condition homogène en la suite de pas de descente, p.137

.. Critère-d-'optimalité, 159 11.3.1 Modifications des hypothèses pour obtenir l'optimalité . . . . 159 11.3.2 Changement d'échelle de temps : construction des intervalles pour la convergence, p.160

». .. Trajectoires-des-vecteurs-spécifiques-À-«-nobacktrack and ». Backtrack, 171 13.1.1 Opérateur de réduction sur les vecteurs spécifiques à « No, p.171

«. Algorithme, ». Nobacktrack, «. Rtrl, and ». , 180 14.4 Application de la propriété centrale à l'algorithme « NoBackTrack » 181 14.4.1 Application de la propriété centrale à la trajectoire « NoBack- Track », p.181

. Établissement-des-conditions-pour-la-convergence....... and ». Nobacktrack, 184 15.2.1 Établissement des conditions non homogènes en la suite de pas de descente pour «, p.184

S. Experiments-on and S. , 209 17.3.1 Presentation of the experiments 209 17.3.2 Description and analysis of the results, p.215

A. New and A. , No More Pesky Learning Rates, 218 .1 LLR applied to the Stochastic Variance Reduced Gradient . . . . . . 219 .2 LLR applied to a general stochastic gradient algorithm