Active Roll-outs in MDP with Irreversible Dynamics

Odalric-Ambrym Maillard; Timothy Mann; Ronald Ortner; Shie Mannor

Pré-Publication, Document De Travail Année : 2019

Active Roll-outs in MDP with Irreversible Dynamics

(1, 2) , (3) , (4) , (5)

1
2
3
4
5

Odalric-Ambrym Maillard

Fonction : Auteur
PersonId : 5563
IdHAL : odalric-ambrym-maillard
ORCID : 0000-0001-7935-7026
IdRef : 158055594

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Sequential Learning

Timothy Mann

Fonction : Auteur
PersonId : 1064021

Google DeepMind

Ronald Ortner

Fonction : Auteur

Montanuniversität Leoben

Shie Mannor

Fonction : Auteur

McGill University = Université McGill [Montréal, Canada]

Résumé

In Reinforcement Learning (RL), regret guarantees scaling with the square root of the time horizon have been shown to hold only for communicating Markov decision processes (MDPs) where any two states are connected. This essentially means that an algorithm can eventually recover from any mistake. However, real-world tasks usually include situations where taking a single "bad" action can permanently trap a learner in a suboptimal region of the state-space. Since it is provably impossible to achieve sub-linear regret in general multi-chain MDPs, we assume a weak mechanism that allows the learner to request additional information. Our main contribution is to address: (i) how much external information is needed, (ii) how and when to use it, and (iii) how much regret is incurred. We design an algorithm that minimizes requests for external information in the form of rollouts of a policy specified by the learner by actively requesting it only when needed. The algorithm provably achieves O(√ T) active regret after T steps in a large class of multi-chain MDPs, by only requesting O(log(T)) rollout transitions. The superiority of our algorithm to standard algorithms such as R-Max and UCRL is demonstrated in experiments on some illustrative grid-world examples. (a) (b) (c) Figure 1: Example of (a) a communicating MDP, (b) a unichain MDP with a single recurrent class, and (c) a multi-chain MDP with two recurrent classes. The circles represent states while the labeled edges represent transitions due to executing actions {a, b, c}.

Mots clés

Reinforcement learning Regret analysis Multi-chain MDP Recoverability c xxxx Odalric-Ambrym

Domaines

Intelligence artificielle [cs.AI] Statistiques [math.ST]

Fichier principal

maillard16a.pdf (904.27 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Odalric-Ambrym Maillard : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02177808

Soumis le : mardi 9 juillet 2019-13:31:43

Dernière modification le : vendredi 5 avril 2024-09:33:21

Dates et versions

hal-02177808 , version 1 (09-07-2019)

Identifiants

HAL Id : hal-02177808 , version 1

Citer

Odalric-Ambrym Maillard, Timothy Mann, Ronald Ortner, Shie Mannor. Active Roll-outs in MDP with Irreversible Dynamics. 2019. ⟨hal-02177808⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CRISTAL INRIA2 CRISTAL-SEQUEL UNIV-LILLE

155 Consultations

181 Téléchargements

Active Roll-outs in MDP with Irreversible Dynamics

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager