Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar 1 Rémi Munos 2 Hilbert Kappen 1
2 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal, Inria Lille - Nord Europe
Abstract : We consider the problem of learning the optimal action-value function in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γin[0, 1) only O(N log(N/δ)/ [(1 - γ)3 ε2]) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1-δ. Further, we prove that, for small values of ε, an order of O(N log(N/δ)/ [(1 - γ)3 ε2]) samples is required to find an ε-optimal policy w.p. 1-δ. We also prove a matching lower bound of Ω(N log(N/δ)/ [(1 - γ)3 ε2]) on the sample complexity of estimating the optimal action-value function. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: The upper bound matches the lower bound interms of N , ε, δ and 1/(1 -γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1-γ).
Document type :
Journal articles
Complete list of metadatas

Cited literature [15 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00831875
Contributor : Rémi Munos <>
Submitted on : Friday, June 7, 2013 - 7:25:53 PM
Last modification on : Thursday, February 21, 2019 - 10:52:49 AM
Long-term archiving on : Tuesday, April 4, 2017 - 6:47:44 PM

File

SampCompRL_MLJ2012.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00831875, version 1

Citation

Mohammad Gheshlaghi Azar, Rémi Munos, Hilbert Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, Springer Verlag, 2013, 91 (3), pp.325-349. ⟨hal-00831875⟩

Share

Metrics

Record views

435

Files downloads

1230