Skip to Main content Skip to Navigation

Algorithmes en temps polynomial pour les semi-bandits combinatoires : apprentissage par renforcement efficace dans des environnements complexes

Abstract : Sequential decision making is a core component of many real-world applications, from computer-network operations to online ads. The major tool for this use is reinforcement learning: an agent takes a sequence of decisions in order to achieve its goal, with typically noisy measurements of the evolution of the environment. For instance, a self-driving car can be controlled by such an agent; the environment is the city in which the car manœuvers. Bandit problems are a class of reinforcement learning for which very strong theoretical properties can be shown. The focus of bandit algorithms is on the exploration-exploitation dilemma: in order to have good performance, the agent must have a deep knowledge of its environment (exploration); however, it should also play actions that bring it closer to its goal (exploitation).In this dissertation, we focus on combinatorial bandits, which are bandits whose decisions are highly structured (a "combinatorial" structure). These include cases where the learning agent determines a path to follow (on a road, in a computer network, etc.) or ads to display on a Website. Such situations share their computational complexity: while it is often easy to determine the optimum decision when the parameters are known (the time to cross a road, the monetary gain of displaying an ad at a given place), the bandit variant (when the parameters must be determined through interactions with the environment) is more complex.We propose two new algorithms to tackle these problems by mathematical-optimisation techniques. Based on weak hypotheses, they have a polynomial time complexity, and yet perform well compared to state-of-the-art algorithms for the same problems. They also enjoy excellent statistical properties, meaning that they find a balance between exploration and exploitation that is close to the theoretical optimum. Previous work on combinatorial bandits had to make a choice between computational burden and statistical performance; our algorithms show that there is no need for such a quandary.
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Thursday, July 22, 2021 - 3:18:12 PM
Last modification on : Sunday, June 26, 2022 - 3:11:32 AM
Long-term archiving on: : Saturday, October 23, 2021 - 6:42:54 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03296009, version 1


Thibaut Cuvelier. Algorithmes en temps polynomial pour les semi-bandits combinatoires : apprentissage par renforcement efficace dans des environnements complexes. Machine Learning [stat.ML]. Université Paris-Saclay, 2021. Français. ⟨NNT : 2021UPASG020⟩. ⟨tel-03296009⟩



Record views


Files downloads