Active Roll-outs in MDP with Irreversible Dynamics - Archive ouverte HAL Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2019

Active Roll-outs in MDP with Irreversible Dynamics

Résumé

In Reinforcement Learning (RL), regret guarantees scaling with the square root of the time horizon have been shown to hold only for communicating Markov decision processes (MDPs) where any two states are connected. This essentially means that an algorithm can eventually recover from any mistake. However, real-world tasks usually include situations where taking a single "bad" action can permanently trap a learner in a suboptimal region of the state-space. Since it is provably impossible to achieve sub-linear regret in general multi-chain MDPs, we assume a weak mechanism that allows the learner to request additional information. Our main contribution is to address: (i) how much external information is needed, (ii) how and when to use it, and (iii) how much regret is incurred. We design an algorithm that minimizes requests for external information in the form of rollouts of a policy specified by the learner by actively requesting it only when needed. The algorithm provably achieves O(√ T) active regret after T steps in a large class of multi-chain MDPs, by only requesting O(log(T)) rollout transitions. The superiority of our algorithm to standard algorithms such as R-Max and UCRL is demonstrated in experiments on some illustrative grid-world examples. (a) (b) (c) Figure 1: Example of (a) a communicating MDP, (b) a unichain MDP with a single recurrent class, and (c) a multi-chain MDP with two recurrent classes. The circles represent states while the labeled edges represent transitions due to executing actions {a, b, c}.
Fichier principal
Vignette du fichier
maillard16a.pdf (904.27 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02177808 , version 1 (09-07-2019)

Identifiants

  • HAL Id : hal-02177808 , version 1

Citer

Odalric-Ambrym Maillard, Timothy Mann, Ronald Ortner, Shie Mannor. Active Roll-outs in MDP with Irreversible Dynamics. 2019. ⟨hal-02177808⟩
155 Consultations
181 Téléchargements

Partager

Gmail Facebook X LinkedIn More