The validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, vol.30, pp.483-485, 1967. ,
Assuming failure independence: are we right to be wrong, FTS'2017, the Workshop on FaultTolerant Systems, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01654639
FTI: High performance fault tolerance interface for hybrid systems, Proc. SC'11, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-01298430
Coping with silent and fail-stop errors at scale by combining replication and checkpointing, J. Parallel and Distributed Computing, vol.122, pp.209-225, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-02082389
Towards optimal multi-level checkpointing, IEEE Trans. Computers, vol.66, issue.7, pp.1212-1226, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01339788
Replication is more efficient than you think: Code and technical report, 2019. ,
An evaluation of User-Level Failure Mitigation support in MPI, Computing, vol.95, issue.12, pp.1171-1184, 2013. ,
Transparent redundant computing with mpi, 2010. ,
Toward exascale resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014. ,
VeloC: very low overhead checkpointing system, 2019. ,
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01199752
Programming models and development software for a space-based many-core processor, 4th Int. Conf. on Space Mission Challenges for Information Technology, pp.95-102, 2011. ,
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006. ,
Optimization of multi-level checkpoint model for large scale HPC applications, 2014. ,
Energy considerations in checkpointing and fault tolerance protocols, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.1-6, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00748006
Reading between the lines of failure logs: Understanding how HPC systems fail, 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp.1-12, 2013. ,
Combining partial redundancy and checkpointing for HPC, ICDCS, 2012. ,
The case for modular redundancy in large-scale high performance computing systems, 2009. ,
Redundant execution of HPC applications with MR-MPI, PDCN. IASTED, 2011. ,
Evaluating the Viability of Process Replication Reliability for Exascale Systems, SC'11, 2011. ,
On Ramanujan's Q-Function, J. Computational and Applied Mathematics, vol.58, pp.103-116, 1995. ,
ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science, vol.9, pp.166-175, 2012. ,
Fault-tolerance by replication in distributed systems, Reliable Software Technologies -Ada-Europe '96, pp.38-57, 1996. ,
Fault-Tolerance Techniques for HighPerformance Computing, Computer Communications and Networks, 2015. ,
Partial redundancy in HPC systems with non-uniform node reliabilities, SC '18, 2018. ,
The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid, IEEE International Symposium on, pp.398-407, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00433523
Computer failure data repository, 2006. ,
VolpexMPI: An MPI library for execution of parallel applications on volatile nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009. ,
Design, modeling, and evaluation of a scalable multi-level checkpointing system, 2010. ,
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, Proc. SC'13, 2013. ,
Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm, Cluster Computing (CLUSTER), 2012 IEEE International Conference on, pp.364-372, 2012. ,
Modeling the impact of checkpoints on next-generation systems, Proc. of IEEE MSST, pp.30-46, 2007. ,
Supporting highly-decoupled threadlevel redundancy for parallel programs, 14th Int. Conf. on HighPerformance Computer Architecture (HPCA), pp.393-404, 2008. ,
See applications run and throughput jump: The case for redundant computing in HPC, Proc. of the Dependable Systems and Networks Workshops, pp.29-34, 2010. ,
Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
Does partial replication pay off? In FTXS, 2012. ,
Programmerdirected partial redundancy for resilient HPC, Computing Frontiers, 2015. ,
Top 500 Supercomputer Sites, 2018. ,
Gauss hypergeometric function. From MathWorld-A Wolfram Web Resource ,
Incomplete Beta Function. From MathWorld-A Wolfram Web Resource ,
Using Replication and Checkpointing for Reliable Task Management in Computational Grids, SC'10, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00788867
A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011. ,
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, IEEE International Conference on, pp.93-103, 2004. ,
Reliability-aware scalability models for high performance computing, Cluster Computing, 2009. ,