Online Markov Decision Processes Under Bandit Feedback

Gergely Neu 1 András György 2 Csaba Szepesvári 2 András Antos 3
1 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of O(T^2/3 ln T). In this paper, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O(T^1/2 ln T), giving the first rigorously proven, essentially tight regret bound for the problem.
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01079422
Contributor : Gergely Neu <>
Submitted on : Saturday, November 1, 2014 - 7:14:21 PM
Last modification on : Thursday, February 21, 2019 - 10:52:49 AM
Long-term archiving on : Monday, February 2, 2015 - 5:00:24 PM

File

NGSA14.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Gergely Neu, András György, Csaba Szepesvári, András Antos. Online Markov Decision Processes Under Bandit Feedback. IEEE Transactions on Automatic Control, Institute of Electrical and Electronics Engineers, 2014, 59, pp.676 - 691. ⟨10.1109/TAC.2013.2292137⟩. ⟨hal-01079422⟩

Share

Metrics

Record views

454

Files downloads

226