| HAL: hal-00579607, version 3 |
| arXiv: 1107.4506 |
| Detailed view | Export this paper |
|
|
| Available versions: | v1 (2011-03-25) | v2 (2011-07-22) | v3 (2011-07-25) |
|
|
|
|
| Robustness of Anytime Bandit Policies |
|
|
| Antoine Salomon 1, 2Jean-Yves Audibert 1, 2 |
|
|
| (2011-03-24) |
|
|
| This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1-1/n, the regret of the policy is of order log(n). They have also shown that such a property is not shared by the popular ucb1 policy of Auer et al. (2002). This work first answers an open question: it extends this negative result to any anytime policy. The second contribution of this paper is to design anytime robust policies for specific multi-armed bandit problems in which some restrictions are put on the set of possible distributions of the different arms. |
|
|
|
|
|
|
|
|
|
|
| 1: | IMAGINE |
| CSTB – Ecole des Ponts ParisTech – Université Paris-Est Créteil Val-de-Marne (UPEC) | |
| 2: | Laboratoire d'Informatique Gaspard-Monge (LIGM) |
| Université Paris-Est Marne-la-Vallée (UPEMLV) – ESIEE – Ecole des Ponts ParisTech – Fédération de Recherche Bézout – CNRS : UMR8049 | |
|
|
|
|
|
|
|
|
| Subject | : | Statistics/Other Statistics |
|
|
| Exploration-exploitation tradeoff – Multi-armed bandits – Risk analysis / deviations |
|
|
| Attached file list to this document: | |||||
|
|
|
| hal-00579607, version 3 | |
| http://hal.archives-ouvertes.fr/hal-00579607 | |
| oai:hal.archives-ouvertes.fr:hal-00579607 | |
| From: Antoine Salomon | |
| Submitted on: Monday, 25 July 2011 14:04:45 | |
| Updated on: Thursday, 21 March 2013 17:26:47 | |