J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

V. Sarkar, W. Harrod, and A. Snavely, Software challenges in extreme scale systems, Journal of Physics: Conference Series, vol.180, issue.1
DOI : 10.1088/1742-6596/180/1/012045

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.
DOI : 10.1109/TDSC.2004.15

B. Schroeder and G. A. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1
DOI : 10.1088/1742-6596/78/1/012022

G. Zheng, X. Ni, and L. Kale, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
DOI : 10.1109/DSNW.2012.6264677

B. Schroeder and G. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1
DOI : 10.1088/1742-6596/78/1/012022

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale highh performance computing systems, Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

W. Jones, J. Daly, and N. Debardeleben, Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters, Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pp.276-279, 2010.
DOI : 10.1145/1851476.1851509

K. Venkatesh, Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications, Analysis, vol.2, issue.08, pp.2690-2697, 2010.

M. Bouguerra, T. Gautier, D. Trystram, and J. Vincent, A Flexible Checkpoint/Restart Model in Distributed Systems, LNCS, vol.6067, pp.206-215, 2010.
DOI : 10.1007/978-3-642-14390-8_22

URL : https://hal.archives-ouvertes.fr/hal-00788926

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

T. Heath, R. P. Martin, and T. D. Nguyen, Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002.
DOI : 10.1145/511399.511362

B. Schroeder and G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proc. of DSN, pp.249-258, 2006.

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS, pp.1-9, 2008.

T. J. Hacker, F. Romero, and C. D. Carothers, An analysis of clustered failures on large supercomputing systems, Journal of Parallel and Distributed Computing, vol.69, issue.7, pp.652-665, 2009.
DOI : 10.1016/j.jpdc.2009.03.007

F. Gärtner, Fundamentals of fault-tolerant distributed computing in asynchronous environments, ACM Computing Surveys, vol.31, issue.1
DOI : 10.1145/311531.311532

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010.
DOI : 10.1109/HPCS.2010.5547140

URL : https://hal.archives-ouvertes.fr/hal-00788867

C. Engelmann and B. Swen, Redundant Execution of HPC Applications with MR-MPI, Parallel and Distributed Computing and Networks / 720: Software Engineering, 2011.
DOI : 10.2316/P.2011.719-031

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.
DOI : 10.1007/978-3-642-03770-2_19

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012.
DOI : 10.1109/ICDCS.2012.56

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off?, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), p.FTXS, 2012.
DOI : 10.1109/DSNW.2012.6264669

C. George and S. S. Vadhiyar, ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.
DOI : 10.1016/j.procs.2012.04.018

N. Kolettis and N. D. Fulton, Software rejuvenation: Analysis, module and applications, in: FTCS '95, IEEE CS, p.381, 1995.

V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi et al., Proactive management of software aging, IBM Journal of Research and Development, vol.45, issue.2, pp.311-332, 2001.
DOI : 10.1147/rd.452.0311

L. Wang, P. Karthik, Z. Kalbarczyk, R. Iyer, L. Votta et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005.
DOI : 10.1109/DSN.2005.67

A. Guermouche, T. Ropars, M. Snir, and F. Cappello, HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.111

URL : https://hal.archives-ouvertes.fr/hal-01121941

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proc. of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012.

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, Combining process replication and checkpointing for resilience on exascale systems, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00697180

P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger, On Ramanujan's Q-function, Journal of Computational and Applied Mathematics, vol.58, issue.1, pp.103-116, 1995.
DOI : 10.1016/0377-0427(93)E0258-N

R. Riesen, K. Ferreira, and J. Stearley, See applications run and throughput jump: The case for redundant computing in HPC, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.29-34, 2010.
DOI : 10.1109/DSNW.2010.5542625

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010.
DOI : 10.1109/CCGRID.2010.71

URL : https://hal.archives-ouvertes.fr/inria-00433523

M. Bougeret, H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, International Journal of High Performance Computing Applications, vol.28, issue.2, 2011.
DOI : 10.1177/1094342013505348

URL : https://hal.archives-ouvertes.fr/hal-00881463