Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning

Freek Stulp; Olivier Sigaud

Pré-Publication, Document De Travail Année : 2012

Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning

(1) , (2, 3)

1
2
3

Freek Stulp

Fonction : Auteur correspondant
PersonId : 1420
IdHAL : freek-stulp
IdRef : 177920629

Connectez-vous pour contacter l'auteur

Flowing Epigenetic Robots and Systems

Olivier Sigaud

Fonction : Auteur
PersonId : 14932
IdHAL : olivier-sigaud
ORCID : 0000-0002-8544-0229
IdRef : 072724714

Institut des Systèmes Intelligents et de Robotique

AMAC

Résumé

Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. There are two main approaches to performing this optimization: reinforcement learning (RL) and black-box optimization (BBO). Whereas BBO algorithms are generic optimization methods that, due to there generality, may also be applied to optimizing policy parameters, RL algorithms are specifically tailored to leveraging the structure of policy improvement problems. In recent years, benchmark comparisons between RL and BBO have been made, and there has been several attempts to specify which approach works best for which types of problem classes. In this article, we make several contributions to this line of research: 1) We define four algorithmic properties that further clarify the relationship between RL and BBO: action-perturbation vs. parameter-perturbation, gradient estimation vs. rewardweighted averaging, use of only rewards vs. use of rewards and state information, actor-critic vs. direct policy search. 2) We show how the chronology of the derivation of ever more powerful algorithms displays a trend towards algorithms based on parameter-perturbation and reward-weighted averaging. A striking feature of this trend is that it has moved RL methods closer and closer to BBO. 3) We continue this trend by applying two modifications to the state-of-the-art "Policy Improvement with Path Integrals" (PI2), which yields an algorithm we denote PIBB. We show that PIBB is a BBO algorithm, and, more specifically, that it is a special case of the "Covariance Matrix Adaptation - Evolutionary Strategy" algorithm. Our empirical evaluation demonstrates that the simpler PIBB outperforms PI2 on simple evaluation tasks in terms of convergence speed and final cost. 4) Although our evaluation implies that, for these five tasks, BBO outperforms RL, we do not hold this to be a general statement, and provide an analysis of why these tasks are particularly well-suited for BBO. Thus, rather than making the case for BBO or RL, one of the main contributions of this article is rather to provide an algorithmic framework in which such cases may be made, as PIBB and PI2 use identical perturbation and parameter update methods, and differ only in being BBO and RL approaches respectively.

Domaines

Apprentissage [cs.LG]

Fichier principal

rl_blackbox_new_HAL.pdf (956.47 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Olivier Sigaud : Connectez-vous pour contacter le contributeur

https://hal.science/hal-00738463

Soumis le : jeudi 4 octobre 2012-15:53:30

Dernière modification le : mercredi 27 mars 2024-15:02:03

Archivage à long terme le : samedi 5 janvier 2013-03:59:03

Dates et versions

hal-00738463 , version 1 (04-10-2012)

Identifiants

HAL Id : hal-00738463 , version 1

Citer

Freek Stulp, Olivier Sigaud. Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning. 2012. ⟨hal-00738463⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UPMC ENSTA CNRS INRIA ISIR INRIA2 SORBONNE-UNIVERSITE SU-SCIENCES ISIR_AMAC

2076 Consultations

2673 Téléchargements

Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager