Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar; Rémi Munos; Hilbert Kappen

doi:10.1007/s10994-013-5368-1

Article Dans Une Revue Machine Learning Année : 2013

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

(1) , (2) , (1)

1
2

Mohammad Gheshlaghi Azar

Fonction : Auteur

Department of Medical Physics and Biophysics

Rémi Munos

Fonction : Auteur
PersonId : 836863

Sequential Learning

Hilbert Kappen

Fonction : Auteur

Department of Medical Physics and Biophysics

Résumé

We consider the problem of learning the optimal action-value function in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with $N$ state-action pairs and the discount factor γin[0, 1) only $O(N log(N/δ)/ [(1 - γ)3 \epsilon^2])$ state-transition samples are required to find an $\epsilon$-optimal estimation of the action-value function with the probability (w.p.) 1-δ. Further, we prove that, for small values of $\epsilon$, an order of $O(N log(N/δ)/ [(1 - γ)3 \epsilon^2])$ samples is required to find an $\epsilon$ -optimal policy w.p. 1-δ. We also prove a matching lower bound of $\Omega(N log(N/δ)/ [(1 - γ)3\epsilon2])$ on the sample complexity of estimating the optimal action-value function. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: The upper bound matches the lower bound interms of $N$ , $\epsilon$, δ and 1/(1 -γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1-γ).

Domaines

Apprentissage [cs.LG]

Fichier principal

SampCompRL_MLJ2012.pdf (628.1 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Rémi Munos : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00831875

Soumis le : vendredi 7 juin 2013-19:25:53

Dernière modification le : vendredi 24 mars 2023-14:52:57

Archivage à long terme le : mardi 4 avril 2017-18:47:44

Dates et versions

hal-00831875 , version 1 (07-06-2013)

Identifiants

HAL Id : hal-00831875 , version 1
DOI : 10.1007/s10994-013-5368-1

Citer

Mohammad Gheshlaghi Azar, Rémi Munos, Hilbert Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 2013, 91 (3), pp.325-349. ⟨10.1007/s10994-013-5368-1⟩. ⟨hal-00831875⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-LILLE3 CNRS INRIA LAGIS CRISTAL INRIA2 CRISTAL-SEQUEL

313 Consultations

2900 Téléchargements

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager