E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief et al., QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, IPDPS'11 -25th IEEE International Parallel & Distributed Processing Symposium, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00547614

E. Agullo, O. Aumage, M. Faverge, N. Furmento, F. Pruvost et al., Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model, TPDS -IEEE Transactions on Parallel and Distributed Systems, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01618526

L. Alvisi and K. Marzullo, Message logging: pessimistic, optimistic, causal, and optimal, conference Name: IEEE Transactions on Software Engineering, vol.24, 1998.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., Fti: High performance fault tolerance interface for hybrid systems, SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00721216

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, Postfailure recovery of MPI communication capability: Design and rationale, The International Journal of High Performance Computing Applications, vol.27, issue.3, 2013.

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault et al., PaRSEC: A programming paradigm exploiting heterogeneity for enhancing scalability, Computing in Science and Engineering, vol.15, issue.6, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00930217

A. Bouteiller, G. Bosilca, and J. Dongarra, Redesigning the message logging model for high performance, Concurrency and Computation: Practice and Experience, vol.22, issue.16, 2010.

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an optimal online checkpoint solution under a two-level hpc checkpoint model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, pp.244-259, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01353871

J. Dongarra and D. Walker, Software libraries for linear algebra computations on high performance computers, SIAM Review, vol.37, issue.2, 1995.

J. Dongarra, T. Herault, and Y. Robert, Revisiting the Double Checkpointing Algorithm, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Pchd Forum, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00768491

E. N. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson, A Survey of Rollback-recovery Protocols in Message-passing Systems, ACM Comput. Surv, vol.34, issue.3, pp.375-408, 2002.

S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, Failures in large scale systems: long-term measurement, analysis, and implications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on -SC '17, 2017.

N. Losada, G. Bosilca, A. Bouteiller, P. González, and M. J. Martín, Local rollback for resilient MPI applications with application-level checkpointing and message logging, Future Generation Computer Systems, vol.91, pp.450-464, 2019.

B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror, and F. Cappello, VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.1530-2075, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02184203

R. Strom and S. Yemini, Optimistic recovery in distributed systems, ACM Transactions on Computer Systems (TOCS), vol.3, issue.3, pp.204-226, 1985.

S. Blackford, The Two-dimensional Block-Cyclic Distribution, 1997.

F. Tessier, V. Vishwanath, and E. Jeannot, TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers, CLUSTER 2017 -IEEE International Conference on Cluster Computing, pp.1-11, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01621344

S. Thibault, On Runtime Systems for Task-based Programming on Heterogeneous Platforms. Habilitationà diriger des recherches, 2018.
URL : https://hal.archives-ouvertes.fr/tel-01959127

M. Vasavada, F. Mueller, P. H. Hargrove, and E. Roman, Comparing different approaches for Incremental Checkpointing: The Showdown, Ottawa Linux Symposium, 2011.

S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado et al., Polyhedral parallel code generation for cuda, ACM Trans. Archit. Code Optim, vol.9, issue.4, pp.1-54, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00786677

J. W. Young, A first order approximation to the optimum checkpoint interval, Commun. ACM, vol.17, issue.9, pp.530-531, 1974.