G. , The validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, vol.30, pp.483-485, 1967.

G. Aupy, Y. Robert, and F. Vivien, Assuming failure independence: are we right to be wrong, FTS'2017, the Workshop on FaultTolerant Systems, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01654639

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI: High performance fault tolerance interface for hybrid systems, Proc. SC'11, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01298430

A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert et al., Coping with silent and fail-stop errors at scale by combining replication and checkpointing, J. Parallel and Distributed Computing, vol.122, pp.209-225, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02082389

A. Benoit, A. Cavelan, V. Le-fèvre, Y. Robert, and H. Sun, Towards optimal multi-level checkpointing, IEEE Trans. Computers, vol.66, issue.7, pp.1212-1226, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01339788

A. Benoit, T. Herault, V. L. Fèvre, and Y. Robert, Replication is more efficient than you think: Code and technical report, 2019.

W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca et al., An evaluation of User-Level Failure Mitigation support in MPI, Computing, vol.95, issue.12, pp.1171-1184, 2013.

R. Brightwell, K. Ferreira, and R. Riesen, Transparent redundant computing with mpi, 2010.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward exascale resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014.

F. Cappello and K. Mohror, VeloC: very low overhead checkpointing system, 2019.

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01199752

S. P. Crago, D. I. Kang, M. Kang, R. Kost, K. Singh et al., Programming models and development software for a space-based many-core processor, 4th Int. Conf. on Space Mission Challenges for Information Technology, pp.95-102, 2011.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of multi-level checkpoint model for large scale HPC applications, 2014.

M. E. Diouri, O. Glück, L. Lefevre, and F. Cappello, Energy considerations in checkpointing and fault tolerance protocols, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.1-6, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00748006

N. El-sayed and B. Schroeder, Reading between the lines of failure logs: Understanding how HPC systems fail, 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp.1-12, 2013.

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining partial redundancy and checkpointing for HPC, ICDCS, 2012.

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale high performance computing systems, 2009.

C. Engelmann and B. Swen, Redundant execution of HPC applications with MR-MPI, PDCN. IASTED, 2011.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the Viability of Process Replication Reliability for Exascale Systems, SC'11, 2011.

P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger, On Ramanujan's Q-Function, J. Computational and Applied Mathematics, vol.58, pp.103-116, 1995.

C. George and S. S. Vadhiyar, ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.

R. Guerraoui and A. Schiper, Fault-tolerance by replication in distributed systems, Reliable Software Technologies -Ada-Europe '96, pp.38-57, 1996.

T. Herault and Y. Robert, Fault-Tolerance Techniques for HighPerformance Computing, Computer Communications and Networks, 2015.

Z. Hussain, T. Znati, and R. Melhem, Partial redundancy in HPC systems with non-uniform node reliabilities, SC '18, 2018.

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid, IEEE International Symposium on, pp.398-407, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00433523

. Lanl, Computer failure data repository, 2006.

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI library for execution of parallel applications on volatile nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, Proc. SC'13, 2013.

X. Ni, E. Meneses, and L. V. Kalé, Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm, Cluster Computing (CLUSTER), 2012 IEEE International Conference on, pp.364-372, 2012.

R. Oldfield, S. Arunagiri, P. Teller, S. Seelam, M. Varela et al., Modeling the impact of checkpoints on next-generation systems, Proc. of IEEE MSST, pp.30-46, 2007.

M. W. Rashid and M. C. Huang, Supporting highly-decoupled threadlevel redundancy for parallel programs, 14th Int. Conf. on HighPerformance Computer Architecture (HPCA), pp.393-404, 2008.

R. Riesen, K. Ferreira, and J. Stearley, See applications run and throughput jump: The case for redundant computing in HPC, Proc. of the Dependable Systems and Networks Workshops, pp.29-34, 2010.

B. Schroeder and G. A. Gibson, Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off? In FTXS, 2012.

O. Subasi, J. Arias, O. Unsal, J. Labarta, and A. Cristal, Programmerdirected partial redundancy for resilient HPC, Computing Frontiers, 2015.

. Top500, Top 500 Supercomputer Sites, 2018.

E. Weisstein, Gauss hypergeometric function. From MathWorld-A Wolfram Web Resource

E. Weisstein, Incomplete Beta Function. From MathWorld-A Wolfram Web Resource

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using Replication and Checkpointing for Reliable Task Management in Computational Grids, SC'10, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974.

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011.

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, IEEE International Conference on, pp.93-103, 2004.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, Cluster Computing, 2009.