M. Araya, V. Thomas, and O. Buffet, Near-optimal BRL using optimistic local transitions, International Conference on Machine Learning (ICML), 2012.
URL : https://hal.archives-ouvertes.fr/hal-00755270

J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate, A Bayesian sampling approach to exploration in reinforcement learning, Uncertainty in Artificial Intelligence (UAI), pp.19-26, 2009.

J. Asmuth and M. Littman, Approaching Bayes-optimalilty using Monte-Carlo tree search, International Conference on Automated Planning and Scheduling (ICAPS), 2011.

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite time analysis of multiarmed bandit problems, Machine Learning, pp.235-256, 2002.

R. Bellman, Dynamic Programming, 1957.

R. I. Brafman and M. Tennenholtz, R-max -a general polynomial time algorithm for near-optimal reinforcement learning, Journal of Machine Learning Research, vol.3, pp.213-231, 2003.

S. Bubeck and R. Munos, Open loop optimistic planning, Conference on Learning Theory (COLT), pp.477-489, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00943119

S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári, Online optimization in X-armed bandits, Neural Information Processing Systems (NIPS), pp.201-208, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00329797

L. Busoniu and R. Munos, Optimistic planning for markov decision processes, International Conference on Artificial Intelligence and Satistics (AISTATS), JMLR W & CP 22, pp.182-189, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00756736

L. Busoniu, R. Munos, B. D. Schutter, and R. Babuska, Optimistic planning for sparsely stochastic systems, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp.48-55, 2011.
DOI : 10.1109/ADPRL.2011.5967375
URL : https://hal.archives-ouvertes.fr/hal-00830125

P. Castro and D. Precup, Smarter Sampling in Model-Based Bayesian Reinforcement Learning, Machine Learning and Knowledge Discovery in Databases, pp.200-214, 2010.
DOI : 10.1007/978-3-642-15880-3_19

]. M. Castronovo, F. Maes, R. Fonteneau, and D. Ernst, Learning exploration/exploitation strategies for single trajectory reinforcement learning, European Workshop on Reinforcement Learning (EWRL), 2012.

R. Coulom, Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, Computers and Games, pp.72-83, 2007.
DOI : 10.1007/978-3-540-75538-8_7
URL : https://hal.archives-ouvertes.fr/inria-00116992

R. Dearden, N. Friedman, and S. Russell, Bayesian Q-learning, National Conference on Artificial Intelligence, pp.761-768, 1998.

C. Dimitrakakis, Tree Exploration for Bayesian RL Exploration, 2008 International Conference on Computational Intelligence for Modelling Control & Automation, pp.1029-1034, 2008.
DOI : 10.1109/CIMCA.2008.32
URL : http://arxiv.org/abs/0902.0392

C. Dimitrakakis and M. G. Lagoudakis, Rollout sampling approximate policy iteration, Machine Learning, pp.157-171, 2008.
DOI : 10.1007/978-3-540-87479-9_6
URL : http://arxiv.org/abs/0805.2027

M. O. Duff, Optimal Learning: Computational procedures for Bayesadaptive Markov decision processes, 2002.

A. A. Feldbaum, Dual control theory. Automation and Remote Control, pp.874-1039, 1960.

S. Gelly, Y. Wang, R. Munos, and O. Teytaud, Modification of UCT with patterns in Monte-Carlo go, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00117266

J. C. Gittins, Multiarmed Bandit Allocation Indices, 1989.
DOI : 10.1002/9780470980033

A. Guez, D. Silver, and P. Dayan, Efficient Bayes-adaptive reinforcement learning using sample-based search, Neural Information Processing Systems (NIPS), 2012.

J. F. Hren and R. Munos, Optimistic Planning of Deterministic Systems, Recent Advances in Reinforcement Learning, pp.151-164, 2008.
DOI : 10.1007/978-3-540-89722-4_12
URL : https://hal.archives-ouvertes.fr/hal-00830182

J. E. Ingersoll, Theory of Financial Decision Making, 1987.

T. Jaksch, R. Ortner, and P. Auer, Near-optimal regret bounds for reinforcement learning, Journal of Machine Learning Research, vol.11, pp.1563-1600, 2010.

L. Kocsis and C. Szepesvári, Bandit Based Monte-Carlo Planning, Machine Learning: ECML 2006, pp.282-293, 2006.
DOI : 10.1007/11871842_29

J. Z. Kolter and A. Y. Ng, Near-Bayesian exploration in polynomial time, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.513-520, 2009.
DOI : 10.1145/1553374.1553441

R. Munos, Optimistic optimization of deterministic functions without the knowledge of its smoothness, Neural Information Processing Systems (NIPS), 2011.
URL : https://hal.archives-ouvertes.fr/hal-00830143

R. Munos, The optimistic principle applied to games, optimization and planning: Towards Foundations of Monte-Carlo Tree Search, 2012.

S. A. Murphy, Optimal dynamic treatment regimes, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.34, issue.2, pp.331-366, 2003.
DOI : 10.1016/0270-0255(86)90088-6

R. Ortner and P. Auer, Logarithmic online regret bounds for undiscounted reinforcement learning, Neural Information Processing Systems (NIPS), 2007.

J. Peters, S. Vijayakumar, and S. Schaal, Reinforcement learning for humanoid robotics, IEEE-RAS International Conference on Humanoid Robots, pp.1-20, 2003.

P. Poupart, N. Vlassis, J. Hoey, and K. Regan, An analytic solution to discrete Bayesian reinforcement learning, Proceedings of the 23rd international conference on Machine learning , ICML '06, pp.697-704, 2006.
DOI : 10.1145/1143844.1143932

M. Riedmiller, Neural Fitted Q Iteration ??? First Experiences with a Data Efficient Neural Reinforcement Learning Method, European Conference on Machine Learning (ECML), pp.317-328, 2005.
DOI : 10.1007/11564096_32

D. Silver and J. Veness, Monte-Carlo planning in large POMDPs, Neural Information Processing Systems (NIPS), 2010.

J. Sorg, S. Singh, and R. L. Lewis, Variance-based rewards for approximate Bayesian reinforcement learning, Uncertainty in Artificial Intelligence, 2010.

M. Strens, A Bayesian framework for reinforcement learning, International Conference on Machine Learning (ICML), pp.943-950, 2000.

R. S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning, pp.9-44, 1988.
DOI : 10.1007/BF00115009

T. J. Walsh, S. Goschin, and M. L. Littman, Integrating sample-based planning and model-based reinforcement learning, AAAI Conference on Artificial Intelligence (AAAI), 2010.

A. Weinstein and M. L. Littman, Bandit-based planning and learning in continuous-action Markov decision processes, International Conference on Automated Planning and Scheduling (ICAPS), 2012.