15863 articles – 31996 references  [version française]
HAL: hal-00579607, version 3

Detailed view  Export this paper
Available versions:
Robustness of Anytime Bandit Policies
Antoine Salomon 1, 2, Jean-Yves Audibert 1, 2
(2011-03-24)

This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1-1/n, the regret of the policy is of order log(n). They have also shown that such a property is not shared by the popular ucb1 policy of Auer et al. (2002). This work first answers an open question: it extends this negative result to any anytime policy. The second contribution of this paper is to design anytime robust policies for specific multi-armed bandit problems in which some restrictions are put on the set of possible distributions of the different arms.
1:  IMAGINE
CSTB – Ecole des Ponts ParisTech – Université Paris-Est Créteil Val-de-Marne (UPEC)
2:  Laboratoire d'Informatique Gaspard-Monge (LIGM)
Université Paris-Est Marne-la-Vallée (UPEMLV) – ESIEE – Ecole des Ponts ParisTech – Fédération de Recherche Bézout – CNRS : UMR8049
Statistics/Other Statistics
Exploration-exploitation tradeoff – Multi-armed bandits – Risk analysis / deviations
Attached file list to this document: 
PDF
anytime.pdf(843.7 KB)