Algorithmes en temps polynomial pour les semi-bandits combinatoires : apprentissage par renforcement efficace dans des environnements complexes

Thibaut Cuvelier

Résumé

Sequential decision making is a core component of many real-world applications, from computer-network operations to online ads. The major tool for this use is reinforcement learning: an agent takes a sequence of decisions in order to achieve its goal, with typically noisy measurements of the evolution of the environment. For instance, a self-driving car can be controlled by such an agent; the environment is the city in which the car manœuvers. Bandit problems are a class of reinforcement learning for which very strong theoretical properties can be shown. The focus of bandit algorithms is on the exploration-exploitation dilemma: in order to have good performance, the agent must have a deep knowledge of its environment (exploration); however, it should also play actions that bring it closer to its goal (exploitation).In this dissertation, we focus on combinatorial bandits, which are bandits whose decisions are highly structured (a "combinatorial" structure). These include cases where the learning agent determines a path to follow (on a road, in a computer network, etc.) or ads to display on a Website. Such situations share their computational complexity: while it is often easy to determine the optimum decision when the parameters are known (the time to cross a road, the monetary gain of displaying an ad at a given place), the bandit variant (when the parameters must be determined through interactions with the environment) is more complex.We propose two new algorithms to tackle these problems by mathematical-optimisation techniques. Based on weak hypotheses, they have a polynomial time complexity, and yet perform well compared to state-of-the-art algorithms for the same problems. They also enjoy excellent statistical properties, meaning that they find a balance between exploration and exploitation that is close to the theoretical optimum. Previous work on combinatorial bandits had to make a choice between computational burden and statistical performance; our algorithms show that there is no need for such a quandary.

La prise de décision séquentielle est une composante essentielle de nombreuses applications, de la gestion des réseaux informatiques aux annonces en ligne. L'outil principal est l'apprentissage par renforcement : un agent prend une séquence de décisions afin d'atteindre son objectif, avec des mesures typiquement bruitées de son environnement. Par exemple, un agent peut contrôler une voiture autonome; l'environnement est la ville dans laquelle la voiture se déplace. Les problèmes de bandits forment une classe d'apprentissage de renforcement pour laquelle on peut démontrer de très forts résultats théoriques. Les algorithmes de bandits se concentrent sur le dilemme exploration-exploitation : pour avoir une bonne performance, l'agent doit avoir une connaissance approfondie de son environnement (exploration) ; cependant, il doit aussi jouer des actions qui le rapprochent de son but (exploitation).Dans cette thèse, nous nous concentrons sur les bandits combinatoires, qui sont des bandits dont les décisions sont très structurées (une structure "combinatoire"). Il s'agit notamment des cas où l'agent détermine un chemin à suivre (sur une route, dans un réseau informatique, etc.) ou des publicités à afficher sur un site Web. De telles situations partagent leur complexité algorithmique : alors qu'il est souvent facile de déterminer la décision optimale lorsque les paramètres sont connus (le temps pour traverser une route, le profit généré par l'affichage d'une publicité à un endroit donné), la variante bandit (lorsque les paramètres doivent être déterminés par des interactions avec l'environnement) est bien plus complexe.Nous proposons deux nouveaux algorithmes pour aborder ces problèmes par des techniques d'optimisation mathématique. Basés sur des hypothèses faibles, ils présentent une complexité temporelle polynomiale, tout en étant performants par rapport aux algorithmes de pointe pour les mêmes problèmes. Ils présentent également d'excellentes propriétés statistiques, ce qui signifie qu'ils trouvent un équilibre entre exploration et exploitation proche de l'optimum théorique. Les travaux précédents sur les bandits combinatoires ont dû faire un choix entre le temps de calcul et la performance statistique ; nos algorithmes montrent que ce dilemme n'a pas lieu d'être.

Polynomial-Time Algorithms for Combinatorial Semibandits : Computationally Tractable Reinforcement Learning in Complex Environments

Algorithmes en temps polynomial pour les semi-bandits combinatoires : apprentissage par renforcement efficace dans des environnements complexes

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager