Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence - Archive ouverte HAL Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2010

Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence

Résumé

We consider model-based reinforcement learning in finite Markov Decision Processes (MDPs), focussing on so-called optimistic strategies. Optimism is usually implemented by carrying out extended value iterations, under a constraint of consistency with the estimated model transition probabilities. In this paper, we strongly argue in favor of using the Kullback-Leibler (KL) divergence for this purpose. By study- ing the linear maximization problem under KL constraints, we provide an efficient algorithm for solving KL-optimistic extended value iteration. When implemented within the structure of UCRL2, the near-optimal method introduced by [Auer et al, 2008], this algorithm also achieves bounded regrets in the undiscounted case. We however provide some geometric arguments as well as a concrete illustration on a simulated example to explain the observed improved practical behavior, particularly when the MDP has reduced connectivity. To analyze this new algorithm, termed KL-UCRL, we also rely on recent deviation bounds for the KL divergence which compare favorably with the L1 deviation bounds used in previous works.
Fichier principal
Vignette du fichier
KLModelBased.pdf (297.5 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-00476116 , version 1 (23-04-2010)
hal-00476116 , version 2 (17-06-2010)
hal-00476116 , version 3 (12-10-2010)

Identifiants

Citer

Sarah Filippi, Olivier Cappé, Aurélien Garivier. Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence. 2010. ⟨hal-00476116v1⟩
220 Consultations
1055 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More