P. Auer-andnicoì-o-cesa-bianchi, Online learning with malicious noise and the closure algorithm Finite-time analysis of the multiarmed bandit problem, ACBF02] Peter Auer, pp.83-99235, 1998.

. Peter-auer, Y. Nicoì-o-cesa-bianchi, R. E. Freund, and . Schapire, The nonstochastic multiarmed bandit problem

S. J. Comput, Thompson sampling for contextual bandits with linear payoffs. CoRR, 2012. [BL05] Léon Bottou and Yann LeCun. On-line learning for very large datasets, BMSS08] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári, pp.48-77137, 2002.

[. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal, Mortal multiarmed bandits, NIPS, pp.273-280, 2008.

[. Chu, L. Li, L. Reyzin, and R. E. Schapire, Contextual bandits with linear payoff functions, JMLR Proceedings, pp.208-214, 2011.

D. Dudík, S. Hsu, N. Kale, J. Karampatziakis, L. Langford et al., Efficient optimal learning for contextual bandits, 1106.

T. Feraud and . Urvoy, A stochastic bandit algorithm for scratch games

L. Wray and . Buntine, JMLR.org, 2012. [FU13] Raphaël Feraud and Tanguy Urvoy. Exploration and exploitation of scratch games, JMLR Proceedings Machine Learning, pp.129-143377, 2013.

[. Gaudel and M. Sebag, Feature selection as a one-player game, Omnipress, pp.359-366, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00484049

[. Kaufmann, N. Korda, and R. Munos, Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis, Algorithmic Learning Theory, Proc. of the 23rd International Conference (ALT), volume LNCS 7568, pp.199-213, 2012.
DOI : 10.1007/978-3-642-34106-9_18

URL : https://hal.archives-ouvertes.fr/hal-00830033

D. Robert, A. Kleinberg, Y. Niculescu-mizil, and . Sharma, Regret bounds for sleeping experts and bandits, COLT, pp.425-436, 2008.

L. Kocsis and C. Szepesvári, Bandit Based Monte-Carlo Planning, KSST08] Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari Proceedings of the 25th International Conference on Machine Learning, ICML '08, pp.282-293, 2006.
DOI : 10.1007/11871842_29

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.1296

[. Li, W. Chu, J. Langford, R. E. Schapire-[-lr85-]-t, H. Lai et al., Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol.6, issue.1, pp.4-22, 1985.
DOI : 10.1016/0196-8858(85)90002-8

K. Volodymyr-mnih, D. Kavukcuoglu, A. Silver, and . Graves, Playing atari with deep reinforcement learning, 2013.

G. [. Rumelhart, R. J. Hinton, and . Williams, Parallel distributed processing : Explorations in the microstructure of cognition chapter Learning Internal Representations by Error Propagation [Ros58] Frank Rosenblatt. The perceptron : A probabilistic model for information storage and organization in the brain, Psychological Review, vol.1, issue.6, pp.318-362, 1958.

P. Seldin, F. Auer, J. Laviolette, R. Shawe-taylor, and . Ortner, Pac-bayesian analysis of contextual bandits, NIPS, pp.1683-1691, 2011.

G. Tesauro, Programming backgammon using self-teaching neural nets, Artificial Intelligence, vol.134, issue.1-2, pp.181-199, 2002.
DOI : 10.1016/S0004-3702(01)00110-2

URL : http://doi.org/10.1016/s0004-3702(01)00110-2

]. W. Tho33 and . Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol.25, pp.285-294, 1933.