R. Akrour, M. Schoenauer, and M. Sebag, APRIL: Active Preference Learning-Based Reinforcement Learning, Proceedings ECMLPKDD 2012, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp.116-131, 2012.
DOI : 10.1007/978-3-642-33486-3_8
URL : https://hal.archives-ouvertes.fr/hal-00722744

J. Audibert, R. Munos, and C. Szepesvári, Tuning Bandit Algorithms in Stochastic Environments, Proceedings of the Algorithmic Learning Theory, pp.150-165, 2007.
DOI : 10.1093/biomet/25.3-4.285
URL : https://hal.archives-ouvertes.fr/inria-00203487

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

H. Beyer and H. Schwefel, Evolution strategies?a comprehensive introduction, Natural Computing, vol.1, issue.1, pp.3-52, 2002.
DOI : 10.1023/A:1015059928466

W. Cheng, J. Fürnkranz, E. Hüllermeier, and S. Park, Preference-Based Policy Iteration: Leveraging Preference Learning for Reinforcement Learning, Proceedings ECMLPKDD 2011, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp.414-429, 2011.
DOI : 10.1007/978-3-642-23780-5_30
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.224.8007

C. Coello, G. Lamont, and D. Van-veldhuizen, Evolutionary algorithms for solving multi-objective problems, 2007.
DOI : 10.1007/978-1-4757-5184-0

E. Even-dar, S. Mannor, and Y. Mansour, PAC Bounds for Multi-armed Bandit and Markov Decision Processes, Proceedings of the 15th Annual Conference on Computational Learning Theory, pp.255-270, 2002.
DOI : 10.1007/3-540-45435-7_18

P. Fishburn, Nontransitive measurable utility, Journal of Mathematical Psychology, vol.26, issue.1, pp.31-67, 1982.
DOI : 10.1016/0022-2496(82)90034-7

J. Fürnkranz, E. Hüllermeier, W. Cheng, and S. Park, Preference-based reinforcement learning: a formal framework and a policy iteration algorithm, Machine Learning, vol.28, issue.1???2, pp.123-156, 2012.
DOI : 10.1007/s10994-012-5313-8

N. Hansen and S. Kern, Evaluating the CMA Evolution Strategy on Multimodal Test Functions, Parallel Problem Solving from Nature-PPSN VIII, pp.282-291, 2004.
DOI : 10.1007/978-3-540-30217-9_29

V. Heidrich-meisner and C. Igel, Variable metric reinforcement learning methods applied to the noisy mountain car problem. Recent Advances in Reinforcement Learning pp, pp.136-150, 2008.

V. Heidrich-meisner and C. Igel, Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.401-408, 2009.
DOI : 10.1145/1553374.1553426

J. Hemelrijk, Note on Wilcoxon's Two-Sample Test when Ties are Present, The Annals of Mathematical Statistics, vol.23, issue.1, pp.133-135, 1952.
DOI : 10.1214/aoms/1177729491

W. Hoeffding, Probability Inequalities for Sums of Bounded Random Variables, Journal of the American Statistical Association, vol.1, issue.301, pp.13-30, 1963.
DOI : 10.1214/aoms/1177730491

S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone, Pac subset selection in stochastic multi-armed bandits, Proceedings of the Twenty-ninth International Conference on Machine Learning, pp.655-662, 2012.

M. Lagoudakis and R. Parr, Reinforcement learning as classification: Leveraging modern classifiers, Proceedings of the 20th International Conference on Machine Learning, pp.424-431, 2003.

A. Lazaric, M. Ghavamzadeh, and R. Munos, Analysis of a classificationbased policy iteration algorithm, Proceedings of the 27th International Conference on Machine Learning, pp.607-614, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482065

O. Maron and A. Moore, Hoeffding races: accelerating model selection search for classification and function approximation, Advances in Neural Information Processing Systems, pp.59-66, 1994.

O. Maron and A. Moore, The Racing Algorithm: Model Selection for Lazy Learners, Artificial Intelligence Review, vol.5, issue.1, pp.193-225, 1997.
DOI : 10.1007/978-94-017-2053-3_8

V. Mnih, C. Szepesvári, and J. Audibert, Empirical Bernstein stopping, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.672-679, 2008.
DOI : 10.1145/1390156.1390241
URL : https://hal.archives-ouvertes.fr/hal-00834983

H. Moulin, Axioms of cooperative decision making, 1988.
DOI : 10.1017/CCOL0521360552

M. Puterman, Markov decision processes: discrete stochastic dynamic programming, 1994.
DOI : 10.1002/9780470316887

G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems, 1994.

R. Serfling, Approximation theorems of mathematical statistics, 1980.

C. Szepesvári, Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol.4, issue.1, 2010.
DOI : 10.2200/S00268ED1V01Y201005AIM009

T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger, Inequalities for the l1 deviation of the empirical distribution, 2003.

R. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, vol.8, issue.3, pp.229-256, 1992.

Y. Yue, J. Broder, R. Kleinberg, and T. Joachims, The K-armed dueling bandits problem, Journal of Computer and System Sciences, vol.78, issue.5, pp.1538-1556, 2012.
DOI : 10.1016/j.jcss.2011.12.028

Y. Zhao, M. Kosorok, and D. Zeng, Reinforcement learning design for cancer clinical trials, Statistics in Medicine, vol.22, issue.1, pp.3294-3315, 2009.
DOI : 10.1002/sim.3720