R. Agrawal, Sample mean based index policies by O(log n) regret for the multi-armed bandit problem, Advances in Applied Probability, vol.32, issue.04, pp.1054-1078, 1995.
DOI : 10.1016/0196-8858(85)90002-8

J. Audibert, R. Munos, and C. Szepesvári, Exploration-exploitation trade-off using variance estimates in multiarmed bandits, Theoretical Computer Science, issue.19, p.410, 2009.

J. Y. Audibert and S. Bubeck, Regret bounds and minimax policies under partial monitoring, Journal of Machine Learning Research, vol.11, pp.2635-2686, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00654356

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

L. Bregman, The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, pp.200-217, 1967.

S. Bubeck and N. Cesa-bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Foundations and Trends?? in Machine Learning, vol.5, issue.1, pp.1-122, 2012.
DOI : 10.1561/2200000024

A. N. Burnetas and M. N. Katehakis, Optimal Adaptive Policies for Markov Decision Processes, Mathematics of Operations Research, vol.22, issue.1, pp.222-255, 1997.
DOI : 10.1287/moor.22.1.222

O. Cappé and A. Garivier, Kullback???Leibler upper confidence bounds for optimal sequential allocation, The Annals of Statistics, vol.41, issue.3, pp.1516-1541, 2013.
DOI : 10.1214/13-AOS1119SUPP

Y. Chow and H. Teicher, Probability theory. 2nd, p.988, 1988.

H. Ian and . Dinwoodie, Mesures dominantes et théoreme de sanov, Annales de l'IHP Probabilités et statistiques, pp.365-373, 1992.

A. Garivier, P. Ménard, and G. Stoltz, Explore first, exploit next: The true shape of regret in bandit problems, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01276324

J. C. Gittins, Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society, Series B, vol.41, issue.2, pp.148-177, 1979.
DOI : 10.1002/9780470980033

J. Honda and A. Takemura, An asymptotically optimal bandit algorithm for bounded support models, Conf. Comput. Learning Theory, 2010.

T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol.6, issue.1, pp.4-22, 1985.
DOI : 10.1016/0196-8858(85)90002-8

URL : https://doi.org/10.1016/0196-8858(85)90002-8

T. Leung and L. , Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, pp.1091-1114, 1987.

T. Leung and L. , Boundary crossing problems for sample means. The Annals of Probability, pp.375-396, 1988.

O. Maillard, R. Munos, and G. Stoltz, A finite-time analysis of multi-armed bandits problems with kullbackleibler divergences, Conf. Comput. Learning Theory, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00574987

H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, vol.58, issue.5, pp.527-535, 1952.
DOI : 10.1090/S0002-9904-1952-09620-8

H. Robbins, Herbert Robbins Selected Papers, 2012.

R. William and . Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol.25, issue.34, pp.285-294, 1933.

R. William and . Thompson, On a criterion for the rejection of observations and the distribution of the ratio of deviation to sample standard deviation, The Annals of Mathematical Statistics, vol.6, issue.4, pp.214-219, 1935.

A. Wald, Sequential Tests of Statistical Hypotheses, The Annals of Mathematical Statistics, vol.16, issue.2, pp.117-186, 1945.
DOI : 10.1214/aoms/1177731118