. References and . Akrour, APRIL: active preference-learning based reinforcement learning, 1208.

. Bena¨?mbena¨?m, Stochastic Approximations and Differential Inclusions, Part II: Applications, Mathematics of Operations Research, vol.31, issue.4, pp.673-695, 2006.
DOI : 10.1287/moor.1060.0213

]. P. Blavatskyy, Axiomatization of a Preference for Most Probable Winner, Theory and Decision, vol.16, issue.1, pp.17-33, 2006.
DOI : 10.1007/s11238-005-4753-z

]. V. Borkar, Stochastic approximation with two time scales, Systems & Control Letters, vol.29, issue.5, pp.291-294, 1997.
DOI : 10.1016/S0167-6911(97)90015-3

]. V. Borkar, Stochastic approximation, Resonance, vol.8, issue.s.471012, 2008.
DOI : 10.1007/s12045-013-0136-x

]. G. Brown, Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951.

]. V. Chvátal, Matrix games, Linear programming , chapter 15, pp.228-239, 1983.

]. P. Fishburn, SSB Utility theory: an economic perspective, Mathematical Social Sciences, vol.8, issue.1, pp.63-94, 1984.
DOI : 10.1016/0165-4896(84)90061-1

]. P. Fishburn, Nontransitive preferences in decision theory, Journal of Risk and Uncertainty, vol.17, issue.2, pp.113-134, 1991.
DOI : 10.1007/BF00056121

. Furnkranz, Preference-based reinforcement learning: a formal framework and a policy iteration algorithm, Machine Learning, pp.123-156, 2012.
DOI : 10.1007/s10994-012-5313-8

. Gilbert, Solving MDPs with Skew Symmetric Bilinear Utility Functions, IJCAI, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01212802

. Gilbert, Reducing the Number of Queries in Interactive Value Iteration, ADT, pp.139-152, 2015.
DOI : 10.1007/978-3-319-23114-3_9

URL : https://hal.archives-ouvertes.fr/hal-01213280

]. J. Hofbauer, Stability for the best response dynamics. Working Paper, 1995.

]. Y. Nakamura, Risk attitudes for nonlinear measurable utility, Annals of Operations Research, vol.49, issue.1, pp.311-333, 1989.
DOI : 10.1007/BF02283527

P. Perea, J. Perea, and . Puerto, Dynamic programming analysis of the TV game ???Who wants to be a millionaire????, European Journal of Operational Research, vol.183, issue.2, pp.805-811, 2007.
DOI : 10.1016/j.ejor.2006.10.041

S. Rivest, E. Rivest, and . Shen, An optimal single-winner preferential voting system based on game theory, Proceedings Third International Workshop on Computational Social Choice, 2010.

M. Von-neumann, O. Von-neumann, and . Morgenstern, Theory of games and economic behavior, 1947.

Z. P. Weng, B. Weng, and . Zanuttini, Interactive value iteration for Markov decision processes with unknown rewards, International Joint Conference in Artificial Intelligence, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00942290

. Weng, Interactive Q-Learning with Ordinal Rewards and Unreliable Tutor, ECML/PKDD Workshop Reinforcement Learning with Generalized Feedback, 2013.

. Wilson, A Bayesian approach for policy learning from trajectory preference queries, Advances in Neural Information Processing Systems 25, pp.1133-1141, 2012.

F. C. Wirth, J. Wirth, and . Fürnkranz, EPMC: every visit preference Monte Carlo for reinforcement learning, Asian Conference on Machine Learning, ACML 2013, pp.483-497, 2013.

N. C. Wirth, G. Wirth, and . Neumann, Model-free preference-based reinforcement learning, 2015.