Zap Q-Learning for Optimal Stopping
Résumé
This paper concerns approximate solutions to the optimal stopping problem for a geometrically ergodic Markov chain on a continuous state space. The starting point is the Galerkin relaxation of the dynamic programming equations that was introduced by Tsitsikilis and Van Roy in the 1990s, which motivated their Q-learning algorithm for optimal stopping. It is known that the convergence rate of Q-learning is in many cases very slow. The reason for slow convergence is explained here, along with a variant of "Zap-Q-learning" algorithm, designed to achieve the optimal rate of convergence. The main contribution is to establish consistency of Zap-Qlearning algorithm for a linear function approximation setting. The theoretical results are illustrated using an example from finance.