Z. Zou, A. Proutiere, and M. Johansson, Online shortest path routing: The value of information, Proceedings of American Control Conference (ACC), 2014.
URL : https://hal.archives-ouvertes.fr/hal-00920068

B. Awerbuch and R. D. Kleinberg, Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches, Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pp.45-53, 2004.

A. György and G. Ottucsák, Adaptive routing using expert advice, The Computer Journal, vol.49, issue.2, pp.180-189, 2006.

A. György, T. Linder, G. Lugosi, and G. Ottucsák, The on-line shortest path problem under partial monitoring, Journal of Machine Learning Research, vol.8, pp.2369-2403, 2007.

T. He, D. Goeckel, R. Raghavendra, and D. Towsley, Endhost-based shortest path routing in dynamic networks, Proceedings of the 32nd IEEE International Conference on Computer Communications (INFOCOM), pp.2202-2210, 2013.

O. Brun, L. Wang, and E. Gelenbe, Big data for autonomic intercontinental overlays, IEEE Journal on Selected Areas in Communications, vol.34, issue.3, pp.575-583, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01461990

N. Cesa-bianchi and G. Lugosi, Combinatorial bandits, Journal of Computer and System Sciences, vol.78, issue.5, pp.1404-1422, 2012.

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, vol.47, pp.235-256, 2002.

W. Chen, Y. Wang, and Y. Yuan, Combinatorial multi-armed bandit: General framework and applications, Proceedings of the 30th International Conference on Machine Learning (ICML), pp.151-159, 2013.

A. Gopalan, S. Mannor, and Y. Mansour, Thompson sampling for complex online problems, Proceedings of the 31st International Conference on Machine Learning (ICML), pp.100-108, 2014.

H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, vol.58, issue.5, pp.527-535, 1952.

T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in applied mathematics, vol.6, issue.1, pp.4-22, 1985.

J. Audibert, S. Bubeck, and G. Lugosi, Regret in online combinatorial optimization, Mathematics of Operations Research, vol.39, issue.1, pp.31-45, 2014.

S. Bubeck, N. Cesa-bianchi, and S. M. Kakade, Towards minimax policies for online linear optimization with bandit feedback, Proceedings of the 25th Conference On Learning Theory (COLT), 2012.

G. Neu and G. Bartók, An efficient algorithm for learning with semibandit feedback, Algorithmic Learning Theory, pp.234-248, 2013.

Y. Gai, B. Krishnamachari, and R. Jain, Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations, IEEE/ACM Transactions on Networking, vol.20, issue.5, pp.1466-1478, 2012.

B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari, Tight regret bounds for stochastic combinatorial semi-bandits, Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.

R. Combes, M. S. Talebi, A. Proutiere, and M. Lelarge, Combinatorial bandits revisited, Advances in Neural Information Processing Systems (NIPS), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01257796

V. Anantharam, P. Varaiya, and J. Walrand, Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part I: IID rewards, IEEE Transactions on Automatic Control, vol.32, issue.11, pp.968-976, 1987.

B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson, Matroid bandits: Fast combinatorial optimization with learning, Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI), 2014.

Y. Gai, B. Krishnamachari, and R. Jain, Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation, Proceedings of Symposium on New Frontiers in Dynamic Spectrum (DySPAN), 2010.

Z. Wen, B. Kveton, and A. Ashkan, Efficient learning in large-scale combinatorial semi-bandits, Proceedings of the 32nd International Conference on Machine Learning (ICML), pp.1113-1122, 2015.

W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol.25, issue.3/4, pp.285-294, 1933.

K. Liu and Q. Zhao, Adaptive shortest-path routing under unknown and stochastically varying link states, Proceedings of the 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt), pp.232-237, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00763780

P. Tehrani and Q. Zhao, Distributed online learning of the shortest path under unknown random edge weights, Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing, pp.3138-3142, 2013.

A. N. Burnetas and M. N. Katehakis, Optimal adaptive policies for Markov decision processes, Mathematics of Operations Research, vol.22, issue.1, pp.222-255, 1997.

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2005.

T. Jaksch, R. Ortner, and P. Auer, Near-optimal regret bounds for reinforcement learning, The Journal of Machine Learning Research, vol.99, pp.1563-1600, 2010.

S. Filippi, O. Cappé, and A. Garivier, Optimism in reinforcement learning and Kullback-Leibler divergence, Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, pp.115-122, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00476116

T. L. Graves and T. L. Lai, Asymptotically efficient adaptive choice of control laws in controlled Markov chains, SIAM Journal on Control and Optimization, vol.35, issue.3, pp.715-743, 1997.

A. Sen and N. Balakrishnan, Convolution of geometrics and a reliability problem, Statistics & Probability Letters, vol.43, issue.4, pp.421-426, 1999.

A. Shapiro, Semi-infinite programming, duality, discretization and optimality conditions, Optimization, vol.58, issue.2, pp.133-161, 2009.

A. Garivier and E. Moulines, On upper-confidence bound policies for non-stationary bandit problems, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00281392

R. Combes and A. Proutiere, Unimodal bandits: Regret lower bounds and optimal algorithms, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01092662

P. Joulani, A. György, and C. Szepesvári, Online learning under delayed feedback, Proceedings of the 30th International Conference on Machine Learning (ICML), pp.1453-1461, 2013.

S. Magureanu, R. Combes, and A. Proutiere, Lipschitz bandits: Regret lower bounds and optimal algorithms, Proceedings of the 27th Conference on Learning Theory (COLT), 2014.
URL : https://hal.archives-ouvertes.fr/hal-01092791

A. Garivier, P. Ménard, and G. Stoltz, Explore first, exploit next: The true shape of regret in bandit problems, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01276324

A. Garivier and O. Cappé, The KL-UCB algorithm for bounded stochastic bandits and beyond, Proceedings of the 24th Conference On Learning Theory (COLT), 2011.