Y. Abbasi, . Bartlett, . Peter, . Kanade, . Varun et al., Online learning in markov decision processes with adversarially chosen transition probability distributions, Advances in Neural Information Processing Systems 26, pp.2508-2516, 2013.

. Auer, . Peter, . Ortner, . Ronald, and C. Szepesvári, Improved Rates for the Stochastic Continuum-Armed Bandit Problem, In COLT, pp.454-468, 2007.
DOI : 10.1007/978-3-540-72927-3_33

M. Azar, . Gheshlaghi, A. Lazaric, and E. Brunskill, Regret Bounds for Reinforcement Learning with Policy Advice, ECML/PKDD, pp.97-112, 2013.
DOI : 10.1007/978-3-642-40988-2_7
URL : https://hal.archives-ouvertes.fr/hal-00924021

J. Baxter, . Bartlett, and L. Peter, Reinforcement learning in pomdp's via direct gradient ascent, ICML, pp.41-48, 2000.

. Bubeck, . Sébastien, . Munos, . Rémi, . Stoltz et al., X-armed bandits, Journal of Machine Learning Research, vol.12, pp.1655-1695, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00450235

. Bubeck, . Sébastien, . Stoltz, Y. Gilles, and J. Yuan, Lipschitz Bandits without the Lipschitz Constant, ALT, pp.144-158, 2011.
DOI : 10.1007/978-3-642-24412-4_14
URL : https://hal.archives-ouvertes.fr/hal-00595692

A. Bull, Adaptive-tree bandits. arXiv preprint arXiv:1302, 2013.
DOI : 10.3150/14-bej644

E. Cope, Regret and Convergence Bounds for a Class of Continuum-Armed Bandit Problems, IEEE Transactions on Automatic Control, vol.54, issue.6, pp.1243-1253, 2009.
DOI : 10.1109/TAC.2009.2019797

J. Djolonga, A. Krause, . Cevher, and . Volkan, High dimensional gaussian process bandits, Neural Information Processing Systems (NIPS), 2013.

T. Jaksch, . Ortner, . Ronald, and P. Auer, Nearoptimal regret bounds for reinforcement learning, Journal of Machine Learning Research, vol.11, pp.1563-1600, 2010.

R. Kleinberg, A. Slivkins, and E. Upfal, Multi-armed bandits in metric spaces, Proceedings of the fourtieth annual ACM symposium on Theory of computing, STOC 08, pp.681-690, 2008.
DOI : 10.1145/1374376.1374475

R. Kleinberg, A. Slivkins, and E. Upfal, Bandits and experts in metric spaces, 2013.

J. Kober and J. Peters, Policy search for motor primitives in robotics, Machine Learning, pp.171-203, 2011.

. Lattimore, . Tor, . Hutter, . Marcus, and P. Sunehag, The sample-complexity of general reinforcement learning, Proceedings of Thirtieth International Conference on Machine Learning (ICML), 2013.

D. A. Levin, W. Peres, and E. L. , Markov chains and mixing times, 2006.
DOI : 10.1090/mbk/058

A. Maurer and M. Pontil, Empirical bernstein bounds and sample variance penalization. arXiv preprint, 2009.

R. Munos, Optimistic optimization of a deterministic function without the knowledge of its smoothness, NIPS, pp.783-791, 2011.

R. Munos, From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning, Machine Learning, 2013.
DOI : 10.1561/2200000038
URL : https://hal.archives-ouvertes.fr/hal-00747575

R. Ortner and . Ryabko, Online regret bounds for undiscounted continuous reinforcement learning, Advances in Neural Information Processing Systems 25, pp.1772-1780, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00765441

B. Scherrer and M. Geist, Policy search: Any local optimum enjoys a global performance guarantee. arXiv preprint, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00829548

A. Slivkins, Contextual bandits with similarity information. CoRR, abs/0907, 2009.

A. Slivkins, Multi-armed bandits on implicit metric spaces, Advances in Neural Information Processing Systems, pp.1602-1610, 2011.