Optimism in Reinforcement Learning and Kullback-Leibler Divergence

Sarah Filippi; Olivier Cappé; Aurélien Garivier

doi:10.1109/ALLERTON.2010.5706896

Communication Dans Un Congrès Année : 2010

Optimism in Reinforcement Learning and Kullback-Leibler Divergence

(1) , (1) , (1)

Sarah Filippi

Fonction : Auteur
PersonId : 862433

Laboratoire Traitement et Communication de l'Information

Olivier Cappé

Fonction : Auteur
PersonId : 1534
IdHAL : olivier-cappe
ORCID : 0000-0001-7415-8669
IdRef : 057106878

Laboratoire Traitement et Communication de l'Information

Aurélien Garivier

Fonction : Auteur
PersonId : 4986
IdHAL : aurelien-garivier
ORCID : 0000-0002-4906-9573
IdRef : 111902495

Laboratoire Traitement et Communication de l'Information

Résumé

We consider model-based reinforcement learning in finite Markov De- cision Processes (MDPs), focussing on so-called optimistic strategies. In MDPs, optimism can be implemented by carrying out extended value it- erations under a constraint of consistency with the estimated model tran- sition probabilities. The UCRL2 algorithm by Auer, Jaksch and Ortner (2009), which follows this strategy, has recently been shown to guarantee near-optimal regret bounds. In this paper, we strongly argue in favor of using the Kullback-Leibler (KL) divergence for this purpose. By studying the linear maximization problem under KL constraints, we provide an ef- ficient algorithm, termed KL-UCRL, for solving KL-optimistic extended value iteration. Using recent deviation bounds on the KL divergence, we prove that KL-UCRL provides the same guarantees as UCRL2 in terms of regret. However, numerical experiments on classical benchmarks show a significantly improved behavior, particularly when the MDP has reduced connectivity. To support this observation, we provide elements of com- parison between the two algorithms based on geometric considerations.

Mots clés

Reinforcement learning Markov decision processes Model- based approaches Optimism Kullback-Leibler divergence Regret bounds

Domaines

Apprentissage [cs.LG] Autres [stat.ML] Statistiques [math.ST] Théorie [stat.TH]

Fichier principal

KLModelBasedHal.pdf (303.75 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Sarah Filippi : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00476116

Soumis le : mardi 12 octobre 2010-02:13:44

Dernière modification le : lundi 22 avril 2024-10:17:34

Archivage à long terme le : jeudi 13 janvier 2011-02:36:08

Dates et versions

hal-00476116 , version 1 (23-04-2010)

hal-00476116 , version 2 (17-06-2010)

hal-00476116 , version 3 (12-10-2010)

Identifiants

HAL Id : hal-00476116 , version 3
ARXIV : 1004.5229
DOI : 10.1109/ALLERTON.2010.5706896

Citer

Sarah Filippi, Olivier Cappé, Aurélien Garivier. Optimism in Reinforcement Learning and Kullback-Leibler Divergence. Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, Sep 2010, Monticello (Illinois), United States. pp.115 - 122, ⟨10.1109/ALLERTON.2010.5706896⟩. ⟨hal-00476116v3⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM CNRS PARISTECH LTCI

220 Consultations

1054 Téléchargements

Optimism in Reinforcement Learning and Kullback-Leibler Divergence

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager