A. Antos, C. Szepesvari, and R. Munos, Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, vol.22, issue.1, pp.89-129, 2008.
DOI : 10.1007/s10994-007-5038-2

URL : https://hal.archives-ouvertes.fr/hal-00830201

L. Baird, Residual Algorithms: Reinforcement Learning with Function Approximation, Proceedings of the Twelfth International Conference on Machine Learning, pp.30-37, 1995.
DOI : 10.1016/B978-1-55860-377-6.50013-X

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.114.5034

D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996.
DOI : 10.1007/0-306-48332-7_333

D. P. Bertsekas and S. E. Shreve, Stochastic Optimal Control (The Discrete Time Case), 1978.

D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996.
DOI : 10.1007/0-306-48332-7_333

A. M. Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor, Regularized policy iteration, Proceedings of Advances in Neural Information Processing Systems 21, pp.441-448, 2008.

A. M. Farahmand, R. Munos, and C. Szepesvári, Error propagation for approximate policy and value iteration, Advances in Neural Information Processing Systems, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00830154

L. Györfi, M. Kohler, A. Krzy?, and H. Walk, A distribution-free theory of nonparametric regression, 2002.
DOI : 10.1007/b97848

M. Lagoudakis and R. Parr, Least-squares policy iteration, Journal of Machine Learning Research, vol.4, pp.1107-1149, 2003.

A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-sample analysis of LSTD, Proceedings of the 27th International Conference on Machine Learning, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482189

R. Munos, Error bounds for approximate policy iteration, 19th International Conference on Machine Learning, pp.560-567, 2003.

R. Munos, Performance bounds in Lp norm for approximate value iteration, SIAM J. Control and Optimization, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00124685

R. Munos and C. Szepesvári, Finite time bounds for sampling based fitted value iteration, Journal of Machine Learning Research, vol.9, pp.815-857, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00120882

W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics), 2007.

M. L. Puterman, Markov Decision Processes ? Discrete Stochastic Dynamic Programming, 1994.

B. Scherrer, Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view, Proceedings of the 27th International Conference on Machine Learning, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00537403

P. J. Schweitzer and A. Seidmann, Generalized polynomial approximations in Markovian decision processes, Journal of Mathematical Analysis and Applications, vol.110, issue.2, pp.568-582, 1985.
DOI : 10.1016/0022-247X(85)90317-8

J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Handbook of Learning and Approximate Dynamic Programming, 2004.
DOI : 10.1109/9780470544785

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

C. J. Watkins, Learning from Delayed Rewards. PhD thesis, King's College, 1989.

R. J. Williams, L. C. Baird, and I. , Tight performance bounds on greedy policies based on imperfect value functions, Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 1994.