M. Geist and B. Scherrer, Off-policy Learning with Eligibility Traces: A Survey, Journal of Machine Learning Research, vol.15, pp.289-333, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00921275

M. Geist and O. Pietquin, Algorithmic Survey of Parametric Value Function Approximation, IEEE Transactions on Neural Networks and Learning Systems, vol.24, issue.6, pp.845-867, 2013.
DOI : 10.1109/TNNLS.2013.2247418
URL : https://hal.archives-ouvertes.fr/hal-00869725

H. Frezza-buet and M. Geist, A C++ Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics, Journal of Machine Learning Research, vol.14, pp.399-402, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00914768

L. Daubigney, M. Geist, S. Chandramohan, and O. Pietquin, A Comprehensive Reinforcement Learning Framework for Dialogue Management Optimization, IEEE Journal of Selected Topics in Signal Processing, vol.6, issue.8, pp.891-902
DOI : 10.1109/JSTSP.2012.2229257

O. Pietquin and M. Geist, Senthilkumar Chandramohan, et Hervé Frezza-Buet. Sample-Efficient Batch Reinforcement Learning for Dialogue Management Optimization, ACM Transactions on Speech and Language Processing, vol.7, issue.3, p.2011

M. Geist and O. Pietquin, Kalman Temporal Differences, Journal of Artificial Intelligence Research (JAIR), vol.39, pp.483-532, 2010.
DOI : 10.1109/adprl.2009.4927543
URL : https://hal.archives-ouvertes.fr/hal-00858687

M. Geist, O. Pietquin, and G. Fricout, From Supervised to Reinforcement Learning: a Kernel-based Bayesian Filtering Framework, International Journal On Advances in Software, vol.2, issue.1, pp.101-116, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00429891

B. Piot, M. Geist, and O. Pietquin, Boosted and Reward-regularized Classification for Apprenticeship Learning, 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2014), p.2014
URL : https://hal.archives-ouvertes.fr/hal-01107837

L. Daubigney, M. Geist, and O. Pietquin, Model-free POMDP optimisation of tutoring systems with echo-state networks, Proceedings of the 14th SIGDial Meeting on Discourse and Dialogue, pp.102-106, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00869773

M. Geist, E. Klein, B. Piot, Y. Guermeur, and O. Pietquin, Around Inverse Reinforcement Learning and Score-based Classification, 1st Multidisciplinary Conference on Reinforcement Learning and Decision Making, p.2013, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00916936

B. Piot, M. Geist, and O. Pietquin, Learning from Demonstrations: Is It Worth Estimating a Reward Function?, Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013), pp.17-32, 2013.
DOI : 10.1007/978-3-642-40988-2_2
URL : https://hal.archives-ouvertes.fr/hal-00916938

E. Klein, B. Piot, M. Geist, and O. Pietquin, A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning, Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2013), pp.1-16, 2013.
DOI : 10.1007/978-3-642-40988-2_1
URL : https://hal.archives-ouvertes.fr/hal-00869804

L. Daubigney, M. Geist, and O. Pietquin, Particle Swarm Optimisation of Spoken Dialogue System Strategies, Proceedings of the 14th Annual Conference of the International Speech Communication Association, p.2013, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00916935

R. Niewiadomski, J. Hofmann, J. Urbain, T. Platt, J. Wagner et al., Laugh-aware virtual agent and its impact on user amusement, International Conference on Autonomous Agents and Multiagent Systems, p.2013, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00869751

L. Daubigney, M. Geist, and O. Pietquin, Random Projetctions: a Remedy for Overfitting Issues in Time Series Prediction with Echo State Networks, IEEE International Conference on Acoustics, Speech and Signal Processing, p.2013, 2013.

S. Chandramohan, M. Geist, F. Lefèvre, and O. Pietquin, Co-adaptation in Spoken Dialogue Systems, International Workshop on Spoken Dialog Systems, p.2012, 2012.
DOI : 10.1007/978-1-4614-8280-2_31
URL : https://hal.archives-ouvertes.fr/hal-00778752

E. Klein, M. Geist, B. Piot, and O. Pietquin, Inverse Reinforcement Learning through Structured Classification, Advances in Neural Information Processing Systems, p.2012, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00778624

S. Chandramohan, M. Geist, F. Lefèvre, and O. Pietquin, Behavior Specific User Simulation in Spoken Dialogue Systems, ITG Conference on Speech Communication, p.2012
URL : https://hal.archives-ouvertes.fr/hal-00749421

B. Scherrer, V. Gabillon, M. Ghavamzadeh, and M. Geist, Approximate Modified Policy Iteration, International Conference on Machine Learning (ICML), p.2012
URL : https://hal.archives-ouvertes.fr/hal-00697169

M. Geist, B. Scherrer, A. Lazaric, and M. Ghavamzadeh, A Dantzig Selector Approach to Temporal Difference Learning, International Conference on Machine Learning (ICML), p.2012
URL : https://hal.archives-ouvertes.fr/hal-00749480

J. Oster, M. Geist, O. Pietquin, and G. Clifford, Filtering of pathological ventricular rhythms during MRI scanning, International Workshop on Biosignal Interpretation, p.2012
URL : https://hal.archives-ouvertes.fr/hal-00749457

S. Chandramohan, M. Geist, F. Lefèvre, and O. Pietquin, Clustering behaviors of Spoken Dialogue Systems users, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p.2012, 2012.
DOI : 10.1109/ICASSP.2012.6289038
URL : https://hal.archives-ouvertes.fr/hal-00685009

L. Daubigney, M. Geist, and O. Pietquin, Off-policy Learning in Largescale POMDP-based Dialogue Systems, IEEE International Conference on Acoustics , Speech and Signal Processing, pp.4989-4992, 2012.
DOI : 10.1109/icassp.2012.6289040
URL : https://hal.archives-ouvertes.fr/hal-00684819

J. Fix and M. Geist, Monte-Carlo Swarm Policy Search, Symposium on Swarm Intelligence and Differential Evolution, p.2012
DOI : 10.1007/978-3-642-29353-5_9
URL : https://hal.archives-ouvertes.fr/hal-00695540

M. Geist and O. Pietquin, Kalman filtering & colored noises: the (autoregressive ) moving-average case, IEEE Workshop on Machine Learning Algorithms, Systems and Applications, p.2011, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00660607

E. Klein, M. Geist, and O. Pietquin, Reducing the dimentionality of the reward space in the Inverse Reinforcement Learning problem, IEEE Workshop on Machine Learning Algorithms, Systems and Applications, p.2011, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00660612

H. Glaude, F. Akrimi, M. Geist, and O. Pietquin, A Non-parametric Approach to Approximate Dynamic Programming, 2011 10th International Conference on Machine Learning and Applications and Workshops, pp.317-322, 2011.
DOI : 10.1109/ICMLA.2011.19
URL : https://hal.archives-ouvertes.fr/hal-00652438

O. Pietquin, L. Daubigney, and M. Geist, Optimization of a Tutoring System from a Fixed Set of Data, ISCA workshop on Speech and Language Technology in Education, p.2011, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652324

L. Daubigney, M. Gasic, S. Chandramohan, M. Geist, O. Pietquin et al., Uncertainty management for on-line optimisation of a POMDP-based large-scale spoken dialogue system, Annual Conference of the International Speech Communication Association, pp.1301-1304, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652194

S. Chandramohan, M. Geist, F. Lefèvre, and O. Pietquin, User Simulation in Dialogue Systems using Inverse Reinforcement Learning, Annual Conference of the International Speech Communication Association, pp.1025-1028, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652446

R. Chou, Y. Boers, M. Podt, and M. Geist, Performance Evaluation for Particle Filters, International Conference on Information Fusion, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652168

O. Pietquin, M. Geist, and . Et-senthilkumar-chandramohan, Sample Efficient On-line Learning of Optimal Dialogue Policies with Kalman Temporal Differences, International Joint Conference on Artificial Intelligence (IJCAI 2011), pp.1878-1883, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00618252

J. Fix, M. Geist, O. Pietquin, and . Et-hervé-frezza-buet, Dynamic neural field optimization using the unscented Kalman filter, 2011 IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), p.2011, 2011.
DOI : 10.1109/CCMB.2011.5952113
URL : https://hal.archives-ouvertes.fr/hal-00618117

M. Geist and O. Pietquin, Parametric value function approximation: A unified view, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp.9-16, 2011.
DOI : 10.1109/ADPRL.2011.5967355
URL : https://hal.archives-ouvertes.fr/hal-00618112

M. Geist and O. Pietquin, Managing Uncertainty within the KTD Framework, Workshop on Active Learning and Experimental Design Journal of Machine Learning Research (Conference and Workshop Proceedings), pp.157-168, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00599636

M. Geist and B. Scherrer, ???1-Penalized Projected Bellman Residual, European Workshop on Reinforcement Learning (EWRL 2011), Lecture Notes in Computer Science (LNCS), 2011.
DOI : 10.1007/978-3-642-29946-9_12
URL : http://hal.inria.fr/docs/00/64/45/07/PDF/gs_ewrl_l1_cr.pdf

E. Klein, M. Geist, and O. Pietquin, Batch, Off-Policy and Model-Free Apprenticeship Learning, European Workshop on Reinforcement Learning, 2011.
DOI : 10.1007/978-3-642-29946-9_28
URL : https://hal.archives-ouvertes.fr/hal-00660623

B. Scherrer and M. Geist, Recursive Least-Squares Learning with Eligibility Traces, European Workshop on Machine Learning (EWRL 2011), Lecture Notes in Computer Science (LNCS), 2011.
DOI : 10.1007/978-3-642-29946-9_14
URL : https://hal.archives-ouvertes.fr/hal-00644511

M. Geist and O. Pietquin, Eligibility traces through colored noises, International Congress on Ultra Modern Telecommunications and Control Systems, pp.458-465, 2010.
DOI : 10.1109/ICUMT.2010.5676597
URL : https://hal.archives-ouvertes.fr/hal-00553910

M. Geist and O. Pietquin, Statistically linearized least-squares temporal differences, International Congress on Ultra Modern Telecommunications and Control Systems, pp.450-457, 2010.
DOI : 10.1109/ICUMT.2010.5676598
URL : https://hal.archives-ouvertes.fr/hal-00554338

S. Chandramohan, M. Geist, and O. Pietquin, Optimizing Spoken Dialogue Management with Fitted Value Iteration, International Conference on Speech Communication and Technologies, pp.86-89, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00553184

S. Chandramohan, M. Geist, and O. Pietquin, Sparse Approximate Dynamic Programming for Dialog Management, SIGDial Conference on Discourse and Dialogue, pp.107-115, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00553180

M. Geist and O. Pietquin, Statistically linearized recursive least squares, 2010 IEEE International Workshop on Machine Learning for Signal Processing, pp.272-276, 2010.
DOI : 10.1109/MLSP.2010.5589236
URL : https://hal.archives-ouvertes.fr/hal-00553168

M. Geist and O. Pietquin, Revisiting Natural Actor-Critics with Value Function Approximation, Modeling Decisions for Artificial Intelligence, pp.207-218, 2010.
DOI : 10.1007/11596448_9
URL : https://hal.archives-ouvertes.fr/hal-00554346

M. Geist, O. Pietquin, and G. Fricout, Tracking in Reinforcement Learning Kernelizing Vector Quantization Algorithms, International Conference on Neural Information Processing ENNS best student paper award 46. Matthieu Geist, Olivier Pietquin, et Gabriel Fricout European Symposium on Artificial Neural Networks (ESANN 09), pp.502-511, 2009.

M. Geist, O. Pietquin, and G. Fricout, Kalman Temporal Differences: The deterministic case, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp.185-192, 2009.
DOI : 10.1109/ADPRL.2009.4927543
URL : https://hal.archives-ouvertes.fr/hal-00380870

M. Geist, O. Pietquin, and G. Fricout, Bayesian Reward Filtering, Recent Advances in Reinforcement Learning, pp.96-109, 2008.
DOI : 10.1145/1143844.1143955
URL : https://hal.archives-ouvertes.fr/hal-00351282

M. Geist, O. Pietquin, and G. Fricout, Online Bayesian kernel regression from nonlinear mapping of observations, 2008 IEEE Workshop on Machine Learning for Signal Processing, pp.309-314, 2008.
DOI : 10.1109/MLSP.2008.4685498
URL : https://hal.archives-ouvertes.fr/hal-00335052

M. Geist, O. Pietquin, and G. Fricout, A Sparse Nonlinear Bayesian Online Kernel Regression, 2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences, pp.199-204, 2008.
DOI : 10.1109/ADVCOMP.2008.7
URL : https://hal.archives-ouvertes.fr/hal-00327081

B. Piot, M. Geist, and O. Pietquin, Classification régularisée par la récompense pour l'Apprentissage par Imitation, Journées Francophones de Plannification, Décision et Apprentissage (JFPDA), p.2013

B. Piot, M. Geist, and O. Pietquin, Apprentissage par démonstrations : vaut-il la peine d'estimer une fonction de récompense?, Journées Francophones de Plannification, Décision et Apprentissage (JFPDA), p.2013

E. Klein, B. Piot, M. Geist, and O. Pietquin, Classi???cation structur??e pour l???apprentissage par renforcement inverse, Conférence Francophone sur l'Apprentissage Automatique, p.2012, 2012.
DOI : 10.3166/ria.27.155-169

J. Fix and M. Geist, Optimisation de contrôleurs par essaim de particules, Conférence Francophone sur l'Apprentissage Automatique, p.2012, 2012.

L. Daubigney, M. Geist, and O. Pietquin, Apprentissage off-policy appliqué à un système de dialogue basé sur les PDMPO, Congrès francophone sur la Reconnaissance de Formes et l'Intelligence Artificielle, p.2012, 2012.

M. Geist, B. Scherrer, A. Lazaric, and M. Ghavamzadeh, Un sélecteur de Dantzig pour l'apprentissage par différences temporelles, Journées Francophones sur la Planification, la Décision et l'Apprentissage pour la conduite des systèmes (JFPDA), p.2012

B. Scherrer, V. Gabillon, M. Ghavamzadeh, and M. Geist, Approximations de l'algorithme Itérations sur les Politiques Modifié, Journées Francophones sur la Planification, la Décision et l'Apprentissage pour la conduite des systèmes (JFPDA), p.2012

S. Chandramohan, M. Geist, F. Lefèvre, and O. Pietquin, Regroupement non-supervisé d'utilisateurs par leur comportement pour les systèmes de dialogue parlé, Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, p.2012, 2012.

L. Daubigney, M. Geist, and O. Pietquin, Apprentissage par renforcement pour la personnalisation d'un logiciel d'enseignement des langues, Conférence sur les Environnements Informatiques pour l'Apprentissage Humain, p.2011, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652516

M. Geist and B. Scherrer, Moindres carrés récursifs pour l'évaluation offpolicy d'une politique avec traces d'éligibilité, Journées Francophones de Planification , Décision et Apprentissage pour la conduite de systèmes, p.2011, 2011.

E. Klein, M. Geist, and O. Pietquin, Apprentissage par imitation étendu au cas batch, off-policy et sans modèle, Journées Francophones de Planification , Décision et Apprentissage pour la conduite de systèmes, p.2011, 2011.

L. Daubigney, M. Geist, and O. Pietquin, Gestion de l'incertitude pour l'optimisation en ligne d'un gestionnaire de dialogues parlés à grande échelle basé sur les POMDP, Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, p.2011, 2011.

S. Chandramohan, M. Geist, and O. Pietquin, Apprentissage par Renforcement Inverse pour la Simulation d'Utilisateurs dans les Systèmes de Dialogue, Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, p.2011, 2011.

P. Abbeel, Y. Andrew, and . Ng, Apprenticeship learning via inverse reinforcement learning, Twenty-first international conference on Machine learning , ICML '04, 2004.
DOI : 10.1145/1015330.1015430
URL : http://www.aicml.cs.ualberta.ca/banff04/icml/pages/papers/335.pdf

D. Achlioptas, Database-friendly random projections: Johnson-Lindenstrauss with binary coins, Journal of Computer and System Sciences, vol.66, issue.4, pp.671-687, 2003.
DOI : 10.1016/S0022-0000(03)00025-4
URL : https://doi.org/10.1016/s0022-0000(03)00025-4

A. Luiz, D. Amaral, and . Meurers, On using intelligent computer-assisted language learning in real-life foreign language teaching and learning. ReCALL, pp.4-24, 2011.

A. Antos, C. Szepesvári, and . Et-rémi-munos, Fitted Q-iteration in continuous action-space MDPs, Advances in neural information processing systems (NIPS), pp.9-16, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00185311

A. Antos, C. Szepesvári, and . Et-rémi-munos, Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, pp.89-129, 2008.
DOI : 10.1007/11776420_42
URL : https://hal.archives-ouvertes.fr/hal-00830201

T. Archibald, K. Mckinnon, and E. L. Thomas, On the Generation of Markov Decision Processes, Journal of the Operational Research Society, vol.46, issue.3, pp.354-361, 1995.
DOI : 10.1057/jors.1995.50

J. Karl and . Astrom, Optimal control of Markov processes with incomplete state information, Journal of Mathematical Analysis and Applications, vol.10, issue.1, p.174, 1965.

C. Bernardo-Ávila-pires, M. Szepesvari, and . Ghavamzadeh, Cost-sensitive multiclass classification risk bounds, International Conference on Machine Learning (ICML), pp.1391-1399, 2013.

C. Leemon and . Baird, Residual Algorithms : Reinforcement Learning with Function Approximation, International Conference on Machine Learning (ICML), pp.30-37, 1995.

M. André, J. Barreto, D. Pineau, and . Precup, Policy iteration based on stochastic factorization, Journal of Artificial Intelligence Research, pp.763-803, 2014.

J. Baxter, L. Peter, and . Bartlett, Infinite-Horizon Gradient-Based Policy Search, Journal of Artificial Intelligence Research, vol.15, pp.319-350, 2001.

O. Beijbom, M. Saberian, D. Kriegman, and N. Vasconcelos, Guess- Averse Loss Functions For Cost-Sensitive Multiclass Boosting, International Conference on Machine Learning (ICML), pp.586-594, 2014.

D. P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific, 1995.

P. Dimitri, J. N. Bertsekas, and . Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996.

P. Dimitri, H. Bertsekas, and . Yu, Projected equation methods for approximate solution of large linear systems, Journal of Computational and Applied Mathematics, vol.227, pp.27-50, 2009.

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, M. Et, and . Lee, Natural actor???critic algorithms, Automatica, vol.45, issue.11, 2009.
DOI : 10.1016/j.automatica.2009.07.008
URL : https://hal.archives-ouvertes.fr/hal-00840470

A. Boularias, J. Kober, R. Et-jan, and . Peters, Relative entropy inverse reinforcement learning, International Conference on Artificial Intelligence and Statistics (AISTATS), pp.182-189, 2011.

J. Steven, A. G. Bradtke, and . Barto, Linear Least-Squares algorithms for temporal difference learning, Machine Learning, pp.33-57, 1996.

E. Candes and T. Tao, The Dantzig selector: Statistical estimation when p is much larger than n, The Annals of Statistics, vol.35, issue.6, pp.2313-2351, 2007.
DOI : 10.1214/009053606000001523
URL : http://doi.org/10.1214/009053606000001523

S. Chandramohan, Revisiting User Simulation in Dialogue Systems : Do we still need them ? Will imitation play the role of simulation, Thèse de Doctorat en Informatique, 2012.
URL : https://hal.archives-ouvertes.fr/tel-00875229

S. Chandramohan, M. Geist, F. Lefèvre, and O. Pietquin, Behavior Specific User Simulation in Spoken Dialogue Systems, ITG Conference on Speech Communication, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00749421

K. Chang, J. Beck, J. Mostow, and A. Corbett, A Bayes Net Toolkit for Student Modeling in Intelligent Tutoring Systems, Intelligent Tutoring Systems, pp.104-113, 2006.
DOI : 10.1007/11774303_11

J. Chemali and A. Lazaric, Direct Policy Iteration with Demonstrations, International Joint Conference on Artificial Intelligence (IJCAI), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01237659

S. Chernova and M. Veloso, Interactive policy learning through confidencebased autonomy, Journal of Artificial Intelligence Research, vol.34, issue.11, 2009.

D. Choi-et-benjamin-van-roy, A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning. Discrete Event Dynamic Systems, pp.207-239, 2006.

R. Chou, Y. Boers, M. Podt, and M. Geist, Performance Evaluation for Particle Filters, International Conference on Information Fusion, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652168

T. Albert, . Corbett, R. John, and . Anderson, Knowledge tracing : Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, pp.253-278, 1994.

T. Albert, . Corbett, R. John, . Anderson, T. Et-alison et al., Student modeling in the ACT Programming Tutor. Cognitively diagnostic assessment, pp.19-41, 1995.

L. Daubigney, Gestion de l'incertitude pour l'optmisation de systèmes interactifs, Thèse de Doctorat en Informatique, 2013.

L. Daubigney, M. Geist, and O. Pietquin, Apprentissage par renforcement pour la personnalisation d'un logiciel d'enseignement des langues, Conférence sur les Environnements Informatiques pour l'Apprentissage Humain, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652516

L. Daubigney, M. Geist, and O. Pietquin, Optimisation par essaims particulaires de stratégies de dialogue, Journées Francophones de Plannification, Décision et Apprentissage (JFPDA), 2013.

L. Daubigney, M. Geist, and O. Pietquin, Random Projetctions : a Remedy for Overfitting Issues in Time Series Prediction with Echo State Networks, IEEE International Conference on Acoustics, Speech and Signal Processing, p.2013, 2013.

D. Pucci-de-farias-et-benjamin-van-roy, The linear programming approach to approximate dynamic programming, Operations Research, vol.51, issue.6, pp.850-865, 2003.

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least Angle Regression, Annals of Statistics, vol.32, issue.2, pp.407-499, 2004.

Y. Engel, Algorithms and Representations for Reinforcement Learning, 2005.

Y. Engel, S. Mannor, and R. Meir, Bayes Meets Bellman : The Gaussian Process Approach to Temporal Difference Learning, International Conference on Machine Learning (ICML), pp.154-161, 2003.

Y. Engel, S. Mannor, and R. Meir, The Kernel Recursive Least-Squares Algorithm, IEEE Transactions on Signal Processing, vol.52, issue.8, pp.2275-2285, 2004.
DOI : 10.1109/TSP.2004.830985
URL : http://www-ee.technion.ac.il/~rmeir/Publications/Engel-Mannor-Meir-IEEE04.pdf

D. Ernst, P. Geurts, and L. Wehenkel, Tree-Based Batch Mode Reinforcement Learning, Journal of Machine Learning Research, vol.6, pp.503-556, 2005.

M. Amir, M. Farahmand, S. Ghavamzadeh, C. Mannor, and . Szepesvári, Regularized policy iteration, Advances in Neural Information Processing Systems (NIPS), pp.441-448, 2009.

C. Amir-massoud-farahmand and . Szepesvári, Model selection in reinforcement learning, Machine Learning, vol.18, issue.1, pp.299-332, 2011.
DOI : 10.1109/TNN.2007.899161

C. Amir-massoud-farahmand, . Szepesvári, and . Et-rémi-munos, Error propagation for approximate policy and value iteration, Advances in Neural Information Processing Systems (NIPS), pp.568-576, 2010.

A. Fern, S. W. Yoon, and R. Givan, Approximate Policy Iteration with a Policy Language Bias : Solving Relational Markov Decision Processes, Journal of Artificial Intelligence Research (JAIR), vol.25, pp.75-118, 2006.

J. Fix and M. Geist, Monte-Carlo Swarm Policy Search, Symposium on Swarm Intelligence and Differential Evolution, 2012.
DOI : 10.1007/978-3-642-29353-5_9
URL : https://hal.archives-ouvertes.fr/hal-00695540

J. Fix and M. Geist, Optimisation de contrôleurs par essaim de particules, Conférence Francophone sur l'Apprentissage Automatique, p.2012, 2012.

J. Fix, M. Geist, O. Pietquin, and . Et-hervé-frezza-buet, Dynamic neural field optimization using the unscented Kalman filter, 2011 IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), 2011.
DOI : 10.1109/CCMB.2011.5952113
URL : https://hal.archives-ouvertes.fr/hal-00618117

H. Frezza-buet and M. Geist, A C++ Template-Based Reinforcement Learning Library : Fitting the Code to the Mathematics, Journal of Machine Learning Research, vol.14, pp.399-402, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00914768

M. Geist, Analyse des données pour l'analyse, le suivi et le contrôle des dispersions, 2006.

M. Geist, Modélisation de chaînes de production et de leurs interactions, Supélec (M2R Mathématiques), 2006.

M. Geist, Optimisation des chaînes de production dans l'industrie sidérurgique : une approche statistique de l'apprentissage par renforcement, Doctorat en Mathématiques, 2009.

M. Geist, A multiplicative UCB strategy for Gamma rewards, European Workshop on Reinforcement Learning (EWRL), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01258820

M. Geist, Soft-max boosting, Machine Learning, 2015.
DOI : 10.1007/s00365-006-0662-3
URL : https://hal.archives-ouvertes.fr/hal-01258816

M. Geist and O. Pietquin, Architectures acteur-critique avec approximation de la valeur, Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, 2010.

M. Geist and O. Pietquin, Gestion de l'incertitude dans le cadre de l'approximation de la fonction de valeur pour l'apprentissage par renforcement, Conférence francophone sur l'apprentissage automatique, pp.101-112, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00553895

M. Geist and O. Pietquin, Linéarisation statistique pour les différences temporelles par moindres carrés, Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, 2010.

M. Geist and O. Pietquin, Kalman filtering & colored noises : the (autoregressive ) moving-average case, IEEE Workshop on Machine Learning Algorithms, Systems and Applications, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00660607

M. Geist, O. Pietquin, and G. Fricout, Filtrage bayésien de la récompense, Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, pp.113-122, 2008.

M. Geist, O. Pietquin, and G. Fricout, Différences Temporelles de Kalman, Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, 2009.
DOI : 10.3166/ria.24.423-443

M. Geist, O. Pietquin, and G. Fricout, Différences Temporelles de Kalman : le cas stochastique, Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes, 2009.
DOI : 10.3166/ria.24.423-443

M. Geist, O. Pietquin, and G. Fricout, From Supervised to Reinforcement Learning : a Kernel-based Bayesian Filtering Framework, International Journal On Advances in Software, vol.2, issue.1, pp.101-116, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00429891

M. Geist, O. Pietquin, and G. Fricout, Kernelizing Vector Quantization Algorithms, European Symposium on Artificial Neural Networks (ESANN 09), pp.541-546, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00429892

M. Geist, O. Pietquin, and G. Fricout, Tracking in Reinforcement Learning, International Conference on Neural Information Processing, pp.502-511, 2009.
DOI : 10.1007/978-3-642-10677-4_57
URL : https://hal.archives-ouvertes.fr/hal-00439316

M. Geist, O. Pietquin, and G. Fricout, Astuce du Noyau & Quantification Vectorielle, Colloque sur la Reconnaissance des Formes et l'Intelligence Artificielle (RFIA'10), 2010.
URL : https://hal.archives-ouvertes.fr/hal-00553114

M. Geist, O. Pietquin, and G. Fricout, Différences temporelles de Kalman : cas déterministe. Revue d'Intelligence Artificielle, pp.423-442, 2010.
DOI : 10.3166/ria.24.423-443

M. Geist and B. Scherrer, ???1-Penalized Projected Bellman Residual, European Workshop on Reinforcement Learning Lecture Notes in Computer Science (LNCS), 2011.
DOI : 10.1007/978-3-642-29946-9_12
URL : http://hal.inria.fr/docs/00/64/45/07/PDF/gs_ewrl_l1_cr.pdf

M. Geist and B. Scherrer, Off-policy Learning with Eligibility Traces : A Survey, Journal of Machine Learning Research, vol.15, pp.289-333, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00921275

M. Geist, B. Scherrer, A. Lazaric, and M. Ghavamzadeh, A Dantzig Selector Approach to Temporal Difference Learning, International Conference on Machine Learning (ICML), 2012.
URL : https://hal.archives-ouvertes.fr/hal-00749480

G. Gordon, Stable Function Approximation in Dynamic Programming, International Conference on Machine Learning (IMCL), 1995.
DOI : 10.1016/B978-1-55860-377-6.50040-2
URL : http://www.cs.berkeley.edu/~pabbeel/cs287-fa09/readings/Gordon-1995.pdf

C. Arthur, P. Graesser, . Chipman, C. Brian, A. Haynes et al., AutoTutor : An intelligent tutoring system with mixed-initiative dialogue, IEEE Transactions on Education, vol.48, issue.4, pp.612-618, 2005.

A. Grubb and D. Bagnell, Generalized Boosting Algorithms for Convex Optimization, International Conference on Machine Learning (ICML), pp.1209-1216, 2011.

S. Grünewälder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton, Modelling transition dynamics in MDPs with RKHS embeddings, International Conference on Machine Learning (ICML), pp.535-542, 2012.

V. Heidrich-meisner and C. Igel, Evolution Strategies for Direct Policy Search, Parallel Problem Solving from Nature?PPSN X, pp.428-437, 2008.
DOI : 10.1007/978-3-540-87700-4_43
URL : http://www.neuroinformatik.ruhr-uni-bochum.de/thbio/members/profil/Heidrich-Meisner/H-MIppsn08.pdf

M. W. Hoffman, A. Lazaric, M. Ghavamzadeh, and . Et-rémi-munos, Regularized Least Squares Temporal Difference Learning with Nested ???2 and ???1 Penalization, European Workshop on Reinforcement Learning (EWRL), 2011.
DOI : 10.1007/978-3-642-29946-9_13

J. Auke, J. Ijspeert, S. Nakanishi, and . Schaal, Learning attractor landscapes for learning motor primitives, Advances in Neural Information Processing Systems (NIPS), pp.1523-1530, 2002.

H. Jaeger, The " echo state " approach to analyzing and training recurrent neural networks, Fraunhofer Institute for Autonomous Intelligent Systems, 2001.

J. Simon, J. K. Julier, and . Uhlmann, Unscented filtering and nonlinear estimation, Proceedings of the IEEE, pp.401-422, 2004.

L. Pack-kaelbling, M. L. Littman, and A. R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence, vol.101, issue.1-2, pp.99-134, 1998.
DOI : 10.1016/S0004-3702(98)00023-X

S. Kakade, A Natural Policy Gradient, Neural Information Processing Systems (NIPS), pp.1531-1538, 2001.

S. Kakade and J. Langford, Approximately optimal approximate reinforcement learning, International Conference on Machine Learning (ICML), pp.267-274, 2002.

M. Kearns and S. Singh, Bias-Variance Error Bounds for Temporal Difference Updates, Conference on Learning Theory (COLT), 2000.

F. John and . Kelley, An iterative design methodology for user-friendly natural language office information applications, ACM Transactions on Information Systems (TOIS), vol.2, issue.1, pp.26-41, 1984.

J. Kennedy and R. Eberhart, Particle swarm optimization, Proceedings of ICNN'95, International Conference on Neural Networks, pp.1942-1948, 1995.
DOI : 10.1109/ICNN.1995.488968

B. Kim, J. Amir-massoud-farahmand, D. Pineau, and . Precup, Learning from limited demonstrations, Advances in Neural Information Processing Systems (NIPS), pp.2859-2867, 2013.

E. Klein, Contributions à l'apprentissage par renforcement inverse, Thèse de Doctorat en Informatique, 2013.
DOI : 10.3166/ria.27.155-169
URL : https://hal.archives-ouvertes.fr/tel-01303275

E. Klein, M. Geist, and O. Pietquin, Batch, Off-Policy and Model-Free Apprenticeship Learning, European Workshop on Reinforcement Learning Lecture Notes in Computer Science (LNCS), 2011.
DOI : 10.1007/978-3-642-29946-9_28
URL : https://hal.archives-ouvertes.fr/hal-00660623

E. Klein, M. Geist, and O. Pietquin, Reducing the dimentionality of the reward space in the Inverse Reinforcement Learning problem, IEEE Workshop on Machine Learning Algorithms, Systems and Applications, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00660612

E. Klein, B. Piot, M. Geist, and O. Pietquin, Apprentissage par renforcement inverse en cascadant classification et régression, Journées Francophones de Plannification, Décision et Apprentissage (JFPDA), 2013.

E. Klein, B. Piot, M. Geist, and O. Pietquin, Classi???cation structur??e pour l???apprentissage par renforcement inverse, Revue d'intelligence artificielle, vol.27, issue.2, 2013.
DOI : 10.3166/ria.27.155-169

J. Kober and J. Peters, Policy search for motor primitives in robotics, Machine Learning, pp.171-203, 2011.
DOI : 10.1007/978-3-319-03194-1_4
URL : http://papers.nips.cc/paper/3545-policy-search-for-motor-primitives-in-robotics.pdf

R. Kenneth, . Koedinger, R. John, . Anderson, H. William et al., Intelligent Tutoring Goes To School in the Big City, International Journal of Artificial Intelligence in Education, vol.8, pp.30-43, 1997.

J. and Z. Kolter, The Fixed Ponts of Off-Policy TD, Neural Information Processing Systems (NIPS), 2011.

J. , Z. Kolter, and A. Y. Ng, Regularization and Feature Selection in Least-Squares Temporal Difference Learning, International Conference on Machine Learning, 2009.
DOI : 10.1145/1553374.1553442
URL : http://www.cs.mcgill.ca/~icml2009/papers/439.pdf

G. Michail, R. Lagoudakis, and . Parr, Least-squares policy iteration, The Journal of Machine Learning Research, vol.4, pp.1107-1149, 2003.

G. Michail, R. Lagoudakis, and . Parr, Reinforcement Learning as Classification : Leveraging Modern Classifiers, International Conference on Machine Learning (ICML), pp.424-431, 2003.

S. Larsson, R. David, and . Traum, Information state and dialogue management in the TRINDI dialogue move engine toolkit, Natural Language Engineering, vol.6, issue.3&4, pp.323-340, 2000.
DOI : 10.1017/S1351324900002539

A. Lazaric, M. Ghavamzadeh, and . Et-rémi-munos, Analysis of a classification-based policy iteration algorithm, International Conference on Machine Learning (ICML), pp.607-614, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482065

Y. Lee, Y. Lin, and G. Wahba, Multicategory Support Vector Machines, Journal of the American Statistical Association, vol.99, issue.465, pp.9967-81, 2004.
DOI : 10.1198/016214504000000098
URL : http://www.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf

O. Lemon, K. Georgila, J. Henderson, and M. Stuttle, An ISU dialogue system exhibiting reinforcement learning of dialogue policies, Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations on, EACL '06, pp.119-122, 2006.
DOI : 10.3115/1608974.1608986

O. Lemon and O. Pietquin, Data-Driven Methods for Adaptive Spoken Dialogue Systems : Computational Learning for Conversational Interfaces, 2012.
DOI : 10.1007/978-1-4614-4803-7
URL : https://hal.archives-ouvertes.fr/hal-00756740

G. Lever, L. Baldassarre, A. Gretton, M. Pontil, and S. Grünewälder, Modelling transition dynamics in MDPs with RKHS embeddings, International Conference on Machine Learning (ICML), pp.535-542, 2012.

E. Levin, R. Pieraccini, and W. Eckert, A stochastic model of humanmachine interaction for learning dialog strategies. Speech and Audio Processing, IEEE Transactions on, vol.8, issue.1, pp.11-23, 2000.

L. Li, J. D. Williams, and S. Balakrishnan, Reinforcement learning for dialog management using least-squares policy iteration and fast feature selection, InterSpeech, pp.2475-2478, 2009.

M. Loth, M. Davy, and P. Preux, Sparse Temporal Difference Learning Using LASSO, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp.352-359, 2007.
DOI : 10.1109/ADPRL.2007.368210
URL : https://hal.archives-ouvertes.fr/inria-00117075

J. Ma and W. B. Powell, Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous , Multidimensional States and Actions, 2010.

H. Maei, C. Szepesvari, S. Bhatnagar, D. Precup, D. Silver et al., Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation, Advances in Neural Information Processing Systems (NIPS), pp.1204-1212, 2009.

R. Hamid, R. S. Maei, and . Sutton, GQ(?) : A general gradient algorithm for temporal-difference prediction learning with eligibility traces, Conference on Artificial General Intelligence (AGI), 2010.

R. Hamid, C. Maei, S. Szepesvari, R. S. Bhatnagar, and . Sutton, Toward Off-Policy Learning Control with Function Approximation, International Conference on Machine Learning (ICML), 2010.

L. Mason, J. Baxter, P. Bartlett, and M. Frean, Boosting algorithms as gradient descent in function space, Neural Information Processing Systems (NIPS), 1999.

L. Mason, J. Baxter, P. Bartlett, and M. Frean, Boosting algorithms as gradient descent in function space, Neural Information Processing Systems (NIPS, 1999.

R. Munos, Error bounds for approximate policy iteration, International Conference on Machine Learning (ICML), pp.560-567, 2003.

R. Munos, Performance Bounds in $L_p$???norm for Approximate Value Iteration, SIAM Journal on Control and Optimization, vol.46, issue.2, pp.541-561, 2007.
DOI : 10.1137/040614384
URL : http://hal.archives-ouvertes.fr/docs/00/12/46/85/PDF/avi_siam_final.pdf

R. Munos and C. Szepesvári, Finite-time bounds for fitted value iteration, The Journal of Machine Learning Research (JMLR), vol.9, pp.815-857, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00120882

T. Munzer, B. Piot, M. Geist, O. Pietquin, and M. Lopes, Inverse Reinforcement Learning in Relational Domains, International Joint Conferences on Artificial Intelligence (IJCAI), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01154650

A. Nedi? and D. P. Bertsekas, Least Squares Policy Evaluation Algorithms with Linear Function Approximation. Discrete Event Dynamic Systems : Theory and Applications, pp.79-110, 2003.

G. Neu and C. Szepesvári, Training parsers by inverse reinforcement learning, Machine Learning, vol.285, issue.5, pp.303-337, 2009.
DOI : 10.1017/CBO9780511546921
URL : https://link.springer.com/content/pdf/10.1007%2Fs10994-009-5110-1.pdf

Y. Andrew, D. Ng, S. Harada, and . Russell, Policy invariance under reward transformations : Theory and application to reward shaping, International Conference on Machine Learning (ICML), pp.278-287, 1999.

Y. Andrew, . Ng, J. Stuart, and . Russell, Algorithms for inverse reinforcement learning, International Conference on Machine Learning (ICML), pp.663-670, 2000.

D. Ormoneit and ?. Sen, Kernel-based reinforcement learning, Machine Learning, vol.49, issue.2/3, pp.161-178, 2002.
DOI : 10.1023/A:1017928328829

J. Oster, M. Geist, O. Pietquin, and G. Clifford, Filtering of pathological ventricular rhythms during MRI scanning, International Workshop on Biosignal Interpretation, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00749457

J. Peters and S. Schaal, Natural Actor-Critic, Neurocomputing, vol.71, issue.7-9, pp.1180-1190, 2008.
DOI : 10.1016/j.neucom.2007.11.026

M. Petrik, G. Taylor, R. Parr, and S. Zilberstein, Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes, International Conference on Machine Learning (ICML), pp.871-878, 2010.

O. Pietquin, A framework for unsupervised learning of dialogue strategies, 2004.

O. Pietquin, Consistent goal-directed user model for realisitc man-machine taskoriented spoken dialogue simulation, IEEE International Conference on Multimedia and Expo, pp.425-428, 2006.
DOI : 10.1109/icme.2006.262563
URL : http://hal.archives-ouvertes.fr/docs/00/21/59/68/PDF/icme-pietquin.pdf

O. Pietquin, L. Daubigney, and M. Geist, Optimization of a Tutoring System from a Fixed Set of Data, ISCA workshop on Speech and Language Technology in Education, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00652324

O. Pietquin and T. Dutoit, A probabilistic framework for dialog simulation and optimal strategy learning, IEEE Transactions on Audio, Speech and Language Processing, vol.14, issue.2, pp.589-599, 2006.
DOI : 10.1109/TSA.2005.855836
URL : https://hal.archives-ouvertes.fr/hal-00207952

O. Pietquin and H. Hastie, A survey on metrics for the evaluation of user simulations. The knowledge engineering review, pp.59-73, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00771654

B. Piot, Apprentissage hors-ligne avec Démonstrations Expertes, Thèse de Doctorat en Informatique, 2014.

B. Piot, M. Geist, and O. Pietquin, Apprentissage par démonstrations : vaut-il la peine d'estimer une fonction de récompense ?, Journées Francophones de Plannification, Décision et Apprentissage (JFPDA), 2013.

B. Piot, M. Geist, and O. Pietquin, Boosted and Reward-regularized Classification for Apprenticeship Learning, 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2014), 2014.
URL : https://hal.archives-ouvertes.fr/hal-01107837

B. Piot, M. Geist, and O. Pietquin, Boosted Bellman Residual Minimization Handling Expert Demonstrations, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)
DOI : 10.1007/978-3-662-44851-9_35
URL : https://hal.archives-ouvertes.fr/hal-01060953

B. Piot, M. Geist, and O. Pietquin, Difference of Convex Functions Programming for Reinforcement Learning, Advances in Neural Information Processing Systems, p.2014, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01104419

B. Piot, M. Geist, and O. Pietquin, Méthode de minimisation du résidu de Bellman boostée qui tient compte des démonstrations expertes, Journées Francophones de Plannification, Décision et Apprentissage (JFPDA), 2014.

B. Piot, M. Geist, and O. Pietquin, Imitation Learning Applied to Embodied Conversational Agents, Machine Learning and Interactive Systems (MLIS), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01225816

B. Piot, O. Pietquin, and M. Geist, Predicting when to laugh with structured classification, Annual Conference of the International Speech Communication Association (InterSpeech), 2014.
URL : https://hal.archives-ouvertes.fr/hal-01104739

A. Bernardo, C. Pires, and . Szepesvári, Statistical linear estimation with penalized estimators : an application to reinforcement learning, International Conference on Machine Learning (ICML), pp.1535-1542, 2012.

L. Martin and . Puterman, Markov Decision Processes : Discrete Stochastic Dynamic Programming, 1994.

L. Martin, . Puterman, C. Moon, and . Shin, Modified policy iteration algorithms for discounted markov decision problems, Management Science, vol.24, issue.11, pp.1127-1137, 1978.

V. Rieser, Bootstrapping reinforcement learning-based dialogue strategies from wizard-of-oz data, 2008.

B. D. Ripley, Stochastic Simulation, 1987.
DOI : 10.1002/9780470316726

A. Gavin, M. Rummery, and . Niranjan, Online Q-Learning using Connectionist Systems, 1994.

S. Russell, Learning agents for uncertain environments, Conference on Computational Learning Theory (COLT), pp.101-103, 1998.
DOI : 10.1145/279943.279964
URL : http://www.eecs.berkeley.edu/~russell/papers/colt98-uncertainty.pdf

E. Robert, Y. Schapire, and . Freund, Boosting : Foundations and algorithms, 2012.

J. Schatzmann, K. Weilhammer, M. Stuttle, and S. Young, A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review, pp.97-126, 2006.

J. Schatztnann, N. Matthew, K. Stuttle, S. Weilhammer, and . Young, Effects of the user model on simulation-based learning of dialogue strategies, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., pp.220-225, 2005.
DOI : 10.1109/ASRU.2005.1566539

B. Scherrer, Approximate Policy Iteration Schemes : A Comparison, International Conference on Machine Learning (ICML), pp.1314-1322, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00989982

B. Scherrer, V. Gabillon, M. Ghavamzadeh, and M. Geist, Approximate Modified Policy Iteration, International Conference on Machine Learning (ICML), 2012.
URL : https://hal.archives-ouvertes.fr/hal-00697169

B. Scherrer and M. Geist, Recursive Least-Squares Learning with Eligibility Traces, Lecture Notes in Computer Science (LNCS), 2011.
DOI : 10.1007/978-3-642-29946-9_14
URL : https://hal.archives-ouvertes.fr/hal-00644511

B. Scherrer and M. Geist, Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2014.
DOI : 10.1007/978-3-662-44845-8_3
URL : https://hal.archives-ouvertes.fr/hal-01091079

B. Scherrer and M. Geist, Quand l'optimalité locale implique une garantie globale : recherche locale de politique dans un espace convexe et algorithme d'itéation sur les politiques conservatif vu comme une montée de gradient, Journées Francophones de Plannification, Décision et Apprentissage (JFPDA), 2014.
DOI : 10.3166/ria.29.685-704

B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist, Approximate Modified Policy Iteration and its Application to the Game of Tetris, Journal of Machine Learning Research, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01091341

B. Scherrer and B. Lesner, On the use of non-stationary policies for stationary infinite-horizon markov decision processes, Advances in Neural Information Processing Systems (NIPS), pp.1826-1834, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00758809

O. Sigaud and O. Buffet, Markov decision processes in artificial intelligence, 2013.
DOI : 10.1002/9781118557426
URL : https://hal.archives-ouvertes.fr/inria-00432735

D. Simon, Optimal State Estimation : Kalman, H Infinity, and Nonlinear Approaches, 2006.
DOI : 10.1002/0470045345

L. Song, J. Huang, A. Smola, and K. Fukumizu, Hilbert space embeddings of conditional distributions with applications to dynamical systems, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.961-968, 2009.
DOI : 10.1145/1553374.1553497

R. S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning, pp.9-44, 1988.
DOI : 10.3758/BF03205056

S. Richard, A. G. Sutton, and . Barto, Reinforcement Learning : An Introduction, 1998.

R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver et al., Fast gradient-descent methods for temporaldifference learning with linear function approximation, International Conference on Machine Learning (ICML), pp.993-1000, 2009.
DOI : 10.1145/1553374.1553501
URL : http://webdocs.cs.ualberta.ca/~sutton/papers/gradTD1.pdf

S. Richard, . Sutton, M. Mahmood, and . White, An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning. arXiv preprint, 2015.

R. S. Sutton, D. A. Mcallester, S. P. Singh, and Y. Mansour, Policy Gradient Methods for Reinforcement Learning with Function Approximation, Neural Information Processing Systems (NIPS), pp.1057-1063, 1999.

U. Syed, E. Robert, and . Schapire, A game-theoretic approach to apprenticeship learning, Advances in Neural Information Processing Systems (NIPS), pp.1449-1456, 2007.

U. Syed and R. E. Schapire, A reduction from apprenticeship learning to classification, Advances in Neural Information Processing Systems (NIPS), pp.2253-2261, 2010.

C. Szepesvári, Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol.4, issue.1, pp.1-103, 2010.
DOI : 10.2200/S00268ED1V01Y201005AIM009

I. Szita, V. Gyenes, and A. Lorincz, Reinforcement Learning with Echo State Networks, International Conference on Artificial Neural Networks (ICANN), 2006.
DOI : 10.1007/11840817_86

M. Tagorti and B. Scherrer, On the Rate of Convergence and Error Bounds for LSTD(?), International Conference on Machine Learning (ICML), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01186667

. Pham-dinh-tao-et-le-thi-hoai-an, Convex analysis approach to dc programming : Theory, algorithms and applications, Acta Mathematica Vietnamica, vol.22, issue.1, pp.289-355, 1997.

. Pham-dinh-tao-et-le-thi-hoai-an, The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems, Annals of Operations Research, vol.133, pp.1-423, 2005.

B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin, Learning structured prediction models, Proceedings of the 22nd international conference on Machine learning , ICML '05, pp.896-903, 2005.
DOI : 10.1145/1102351.1102464

G. Taylor and R. Parr, Value function approximation in noisy environments using locally smoothed regularized approximate linear programs, Uncertainty in Artificial Intelligence (UAI), 2012.

P. Thomas, G. Theocharous, and M. Ghavamzadeh, High confidence off-policy evaluation, Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society, vol.58, issue.1, pp.267-288, 1996.
DOI : 10.1111/j.1467-9868.2011.00771.x

J. N. Tsitsiklis and B. Van-roy, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, issue.5, pp.674-690, 1997.
DOI : 10.1109/9.580874

R. Van-der-merwe, Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models, 2004.

N. Vladimir and . Vapnik, Statistical learning theory, 1998.

S. Young, M. Ga?i?, S. Keizer, F. Mairesse, J. Schatzmann et al., The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management, Computer Speech & Language, vol.24, issue.2, pp.150-174, 2010.
DOI : 10.1016/j.csl.2009.04.001
URL : https://hal.archives-ouvertes.fr/hal-00598186

S. Young, J. Schatzmann, K. Weilhammer, and H. Ye, The Hidden Information State Approach to Dialog Management, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '07, 2007.
DOI : 10.1109/ICASSP.2007.367185

H. Yu, Convergence of least-squares temporal difference methods under general conditions, International Conference on Machine Learning (ICML), 2010.
DOI : 10.1137/100807879

H. Yu and D. P. Bertsekas, Q-Learning Algorithms for Optimal Stopping Based on Least Squares, European Control Conference, 2007.

D. Brian, . Ziebart, L. Andrew, A. Maas, . Bagnell et al., Maximum Entropy Inverse Reinforcement Learning, AAAI Conference on Artificial Intelligence, pp.1433-1438, 2008.

H. Zou and T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.5, issue.2, pp.301-320, 2005.
DOI : 10.1073/pnas.201162998