KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints

We consider $K$-–armed stochastic bandits and consider cumulative regret bounds up to time $T$. We are interested in strategies achieving \emph{simultaneously} a distribution-free regret bound of optimal order $\sqrt{KT}$ and a distribution-dependent regret that is asymptotically optimal, that is, matching the $\kappa\ln T$ lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where $\kappa$ is the optimal problem-dependent constant. This constant $\kappa$ depends on the model $\mathcal{D}$ considered (the family of possible distributions over the arms). M{\'e}nard and Garivier (2017) provided strategies achieving such a bi-optimality in the parametric case of models given by one-dimensional exponential families, while Lattimore (2016, 2018) did so for the family of (sub)Gaussian distributions with variance less than $1$. We extend this result to the non-parametric case of all distributions over $[0,1]$. We do so by combining the MOSS strategy by Audibert and Bubeck (2009), which enjoys a distribution-free regret bound of optimal order $\sqrt{KT}$, and the KL-UCB strategy by Capp{\'e} et al. (2013), for which we provide in passing the first analysis of an optimal distributiondependent $\kappa\ln T$ regret bound in the model of all distributions over $[0,1]$. We were able to obtain this non-parametric bi-optimality result while working hard to streamline the proofs (of previously known regret bounds and thus of the new analyses carried out); a second merit of the present contribution is therefore to provide a review of proofs of classical regret bounds for index-based strategies for K–armed stochastic bandits.

Mots clés

K-armed stochastic bandits distribution-dependent regret bounds K–armed stochastic bandits distribution-free regret bounds index policies

Domaines

Machine Learning [stat.ML] Probabilités [math.PR] Théorie [stat.TH]

Fichier principal

KL-UCB--GHMS.pdf (5.08 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Gilles Stoltz : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01785705

Soumis le : mardi 28 juin 2022-17:35:09

Dernière modification le : jeudi 14 mars 2024-03:15:21

Dates et versions

hal-01785705 , version 1 (04-05-2018)

hal-01785705 , version 2 (05-11-2019)

hal-01785705 , version 3 (28-06-2022)

Identifiants

HAL Id : hal-01785705 , version 3

Citer

Aurélien Garivier, Hédi Hadiji, Pierre Ménard, Gilles Stoltz. KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints. Journal of Machine Learning Research, 2022, 23 (179), pp.1-66. ⟨hal-01785705v3⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 LM-ORSAY INRIA2 UNIV-PARIS-SACLAY UDL ANR GS-MATHEMATIQUES GS-COMPUTER-SCIENCE

440 Consultations

716 Téléchargements