Upper Confidence Reinforcement Learning exploiting state-action equivalence

Odalric-Ambrym Maillard 1 Mahsa Asadi 2
1 SEQUEL - Sequential Learning
Inria Lille - Nord Europe, CRIStAL - Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189
2 MAGNET - Machine Learning in Information Networks
Inria Lille - Nord Europe, CRIStAL - Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189
Abstract : Leveraging an equivalence property on the set of states of state-action pairs in an Markov Decision Process (MDP) has been suggested by many authors. We take the study of equivalence classes to the reinforcement learning (RL) setup, when transition distributions are no longer assumed to be known, in a discrete MDP with average reward criterion and no reset. We study powerful similarities between state-action pairs related to optimal transport. We first analyze a variant of the UCRL2 algorithm called C-UCRL2, which highlights the clear benefit of leveraging this equivalence structure when it is known ahead of time: the regret bound scales as ~O(D√KCT) where C is the number of classes of equivalent state-action pairs and K bounds the size of the support of the transitions. A non trivial question is whether this benefit can still be observed when the structure is unknown and must be learned while minimizing the regret. We propose a sound clustering technique that provably learn the unknown classes, but show that its natural combination with UCRL2 empirically fails. Our findings suggests this is due to the ad-hoc criterion for stopping the episodes in UCRL2. We replace it with hypothesis testing, which in turns considerably improves all strategies. It is then empirically validated that learning the structure can be beneficial in a full-blown RL problem.
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01945034
Contributor : Odalric-Ambrym Maillard <>
Submitted on : Wednesday, December 5, 2018 - 9:57:00 AM
Last modification on : Friday, April 19, 2019 - 4:55:27 PM
Long-term archiving on : Wednesday, March 6, 2019 - 1:08:17 PM

File

UCRL_Classes_HAL.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01945034, version 1

Citation

Odalric-Ambrym Maillard, Mahsa Asadi. Upper Confidence Reinforcement Learning exploiting state-action equivalence. 2018. ⟨hal-01945034⟩

Share

Metrics

Record views

65

Files downloads

250