Skip to Main content Skip to Navigation
Conference papers

Optimism in Reinforcement Learning and Kullback-Leibler Divergence

Abstract : We consider model-based reinforcement learning in finite Markov De- cision Processes (MDPs), focussing on so-called optimistic strategies. In MDPs, optimism can be implemented by carrying out extended value it- erations under a constraint of consistency with the estimated model tran- sition probabilities. The UCRL2 algorithm by Auer, Jaksch and Ortner (2009), which follows this strategy, has recently been shown to guarantee near-optimal regret bounds. In this paper, we strongly argue in favor of using the Kullback-Leibler (KL) divergence for this purpose. By studying the linear maximization problem under KL constraints, we provide an ef- ficient algorithm, termed KL-UCRL, for solving KL-optimistic extended value iteration. Using recent deviation bounds on the KL divergence, we prove that KL-UCRL provides the same guarantees as UCRL2 in terms of regret. However, numerical experiments on classical benchmarks show a significantly improved behavior, particularly when the MDP has reduced connectivity. To support this observation, we provide elements of com- parison between the two algorithms based on geometric considerations.
Complete list of metadata

Cited literature [16 references]  Display  Hide  Download
Contributor : Sarah Filippi Connect in order to contact the contributor
Submitted on : Tuesday, October 12, 2010 - 2:13:44 AM
Last modification on : Friday, November 6, 2020 - 11:36:04 PM
Long-term archiving on: : Thursday, January 13, 2011 - 2:36:08 AM


Files produced by the author(s)




Sarah Filippi, Olivier Cappé, Aurélien Garivier. Optimism in Reinforcement Learning and Kullback-Leibler Divergence. Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, Sep 2010, Monticello (Illinois), United States. pp.115 - 122, ⟨10.1109/ALLERTON.2010.5706896⟩. ⟨hal-00476116v3⟩



Les métriques sont temporairement indisponibles