QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, IPDPS'11 -25th IEEE International Parallel & Distributed Processing Symposium, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00547614
Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model, TPDS -IEEE Transactions on Parallel and Distributed Systems, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01618526
Message logging: pessimistic, optimistic, causal, and optimal, conference Name: IEEE Transactions on Software Engineering, vol.24, 1998. ,
Fti: High performance fault tolerance interface for hybrid systems, SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00721216
Postfailure recovery of MPI communication capability: Design and rationale, The International Journal of High Performance Computing Applications, vol.27, issue.3, 2013. ,
PaRSEC: A programming paradigm exploiting heterogeneity for enhancing scalability, Computing in Science and Engineering, vol.15, issue.6, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00930217
Redesigning the message logging model for high performance, Concurrency and Computation: Practice and Experience, vol.22, issue.16, 2010. ,
Toward an optimal online checkpoint solution under a two-level hpc checkpoint model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, pp.244-259, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01353871
Software libraries for linear algebra computations on high performance computers, SIAM Review, vol.37, issue.2, 1995. ,
Revisiting the Double Checkpointing Algorithm, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Pchd Forum, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00768491
A Survey of Rollback-recovery Protocols in Message-passing Systems, ACM Comput. Surv, vol.34, issue.3, pp.375-408, 2002. ,
Failures in large scale systems: long-term measurement, analysis, and implications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on -SC '17, 2017. ,
Local rollback for resilient MPI applications with application-level checkpointing and message logging, Future Generation Computer Systems, vol.91, pp.450-464, 2019. ,
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.1530-2075, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02184203
Optimistic recovery in distributed systems, ACM Transactions on Computer Systems (TOCS), vol.3, issue.3, pp.204-226, 1985. ,
The Two-dimensional Block-Cyclic Distribution, 1997. ,
TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers, CLUSTER 2017 -IEEE International Conference on Cluster Computing, pp.1-11, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01621344
On Runtime Systems for Task-based Programming on Heterogeneous Platforms. Habilitationà diriger des recherches, 2018. ,
URL : https://hal.archives-ouvertes.fr/tel-01959127
Comparing different approaches for Incremental Checkpointing: The Showdown, Ottawa Linux Symposium, 2011. ,
Polyhedral parallel code generation for cuda, ACM Trans. Archit. Code Optim, vol.9, issue.4, pp.1-54, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00786677
A first order approximation to the optimum checkpoint interval, Commun. ACM, vol.17, issue.9, pp.530-531, 1974. ,