Abstract : This manuscript deals with the estimation of the optimal rule and its mean
reward in a simple bandit setting where, at each round, the player is given a
context, chooses one of two actions based on the context and all past
observations, and receives a reward corresponding to the action undertaken.
The player focuses on the mean reward and tries to narrow her confidence
interval for it, but it happens that she can also estimate her regret.
Inference hinges on the targeted learning methodology. A simulation study
illustrates the results of the manuscript.