Skip to Main content Skip to Navigation
Preprints, Working Papers, ...

On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems

Abstract : Multi-armed bandit problems are considered as a paradigm of the trade-off between exploring the environment to find profitable actions and exploiting what is already known. In the stationary case, the distributions of the rewards do not change in time, Upper-Confidence Bound (UCB) policies have been shown to be rate optimal. A challenging variant of the MABP is the non-stationary bandit problem where the gambler must decide which arm to play while facing the possibility of a changing environment. In this paper, we consider the situation where the distributions of rewards remain constant over epochs and change at unknown time instants. We analyze two algorithms: the discounted UCB and the sliding-window UCB. We establish for these two algorithms an upper-bound for the expected regret by upper-bounding the expectation of the number of times a suboptimal arm is played. For that purpose, we derive a Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor.
Document type :
Preprints, Working Papers, ...
Complete list of metadatas

Cited literature [17 references]  Display  Hide  Download
Contributor : Aurélien Garivier <>
Submitted on : Thursday, May 22, 2008 - 11:07:53 AM
Last modification on : Friday, July 31, 2020 - 10:44:06 AM
Document(s) archivé(s) le : Friday, May 28, 2010 - 7:50:15 PM


Files produced by the author(s)


  • HAL Id : hal-00281392, version 1
  • ARXIV : 0805.3415



Aurélien Garivier, Eric Moulines. On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems. 2008. ⟨hal-00281392⟩



Record views


Files downloads