J. Abernethy, E. Hazan, and A. Rakhlin, Competing in the dark: An efficient algorithm for bandit linear optimization, Proceedings of the 21st Annual Conference on Learning Theory (COLT), pp.263-274, 2008.

J. Y. Audibert, S. Bubeck, and G. Lugosi, Regret in online combinatorial optimization Mathematics of Operations Research, 2014.

G. Bartók, D. Pál, C. Szepesvári, and I. Szita, Online learning. Lecture notes, 2011.

S. Boyd and L. Vandenberghe, Convex Optimization, 2004.

C. Daniel, G. Neumann, and J. Peters, Hierarchical relative entropy policy search, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics Conference Proceedings, pp.273-281, 2012.

O. Dekel and E. Hazan, Better rates for any adversarial deterministic mdp, Proceedings of the 30th International Conference on Machine Learning (ICML-13) Conference Proceedings, pp.675-683, 2013.

E. Even-dar, S. M. Kakade, and Y. Mansour, Experts in a Markov decision process, NIPS-17, pp.401-408, 2005.

E. Even-dar, S. M. Kakade, and Y. Mansour, Online Markov Decision Processes, Mathematics of Operations Research, vol.34, issue.3, pp.726-736, 2009.
DOI : 10.1287/moor.1090.0396

A. György, T. Linder, G. Lugosi, and G. Ottucsák, The on-line shortest path problem under partial monitoring, Journal of Machine Learning Research, vol.8, pp.2369-2403, 2007.

S. Kakade, A natural policy gradient, Advances in Neural Information Processing Systems 14 (NIPS), pp.1531-1538, 2001.

W. M. Koolen, M. K. Warmuth, and J. And-kivinen, Hedging structured concepts, Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pp.93-105, 2010.

B. Martinet, Régularisation d'inéquations variationnelles par approximations successives, ESAIM: Mathematical Modelling and Numerical Analysis -Modélisation Mathématique et Analyse Numérique, vol.4, issue.R3, pp.154-158, 1970.

G. Neu, A. György, and C. Szepesvári, The online loop-free stochastic shortestpath problem, Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pp.231-243, 2010.

G. Neu, A. György, and C. Szepesvári, The adversarial stochastic shortest path problem with unknown transition probabilities, AISTATS 2012, pp.805-813, 2012.

G. Neu, A. György, C. Szepesvári, and A. Antos, Online Markov Decision Processes Under Bandit Feedback, NIPS-23, pp.1804-1812, 2010.
DOI : 10.1109/TAC.2013.2292137
URL : https://hal.archives-ouvertes.fr/hal-01079422

J. Peters, K. Mülling, and Y. And-altun, Relative entropy policy search, AAAI 2010, pp.1607-1612, 2010.

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1994.
DOI : 10.1002/9780470316887

A. Rakhlin, Lecture notes on online learning, 2009.

R. T. Rockafellar, Monotone Operators and the Proximal Point Algorithm, SIAM Journal on Control and Optimization, vol.14, issue.5, pp.877-898, 1976.
DOI : 10.1137/0314056

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

C. Szepesvári, Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol.4, issue.1, 2010.
DOI : 10.2200/S00268ED1V01Y201005AIM009

J. Y. Yu, S. Mannor, and N. Shimkin, Markov Decision Processes with Arbitrary Reward Processes, Mathematics of Operations Research, vol.34, issue.3, pp.737-757, 2009.
DOI : 10.1287/moor.1090.0397

M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, Proceedings of the Twentieth International Conference on Machine Learning, pp.928-936, 2003.