Improving Performance via Mini-applications, Research Report, vol.5574, 2009. ,
SPEEDUP-AWARE CO-SCHEDULES FOR EFFICIENT WORKLOAD MANAGEMENT, Parallel Processing Letters, vol.25, issue.02 ,
DOI : 10.1006/jcph.1995.1039
Co-scheduling algorithms for highthroughput workload execution ,
DOI : 10.1007/s10951-015-0445-x
URL : https://hal.archives-ouvertes.fr/hal-00819036
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
Scheduling multiprocessor tasks to minimize schedule length, Computers, IEEE Transactions on C, vol.35, issue.5, pp.389-393, 1986. ,
DOI : 10.1109/tc.1986.1676781
Complexity of Scheduling Parallel Task Systems, SIAM Journal on Discrete Mathematics, vol.2, issue.4, pp.473-487, 1989. ,
DOI : 10.1137/0402042
Approximation Algorithms for Scheduling Independent Malleable Tasks, Euro-Par, 2001. ,
DOI : 10.1007/3-540-44681-8_29
The Implementation of the Cilk-5 Multithreaded Language, Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation , PLDI '98, pp.212-223, 1998. ,
Enhancing the performance of malleable MPI applications by using performance-aware dynamic reconfiguration, Parallel Computing, vol.46, pp.60-77, 2015. ,
DOI : 10.1016/j.parco.2015.04.003
Detection and correction of silent data corruption for large-scale high-performance computing, Proceedings of SC'12, pp.781-7812, 2012. ,
Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, pp.364-372, 2012. ,
DOI : 10.1109/CLUSTER.2012.82
Performance and reliability trade-offs for the double checkpointing algorithm, International Journal of Networking and Computing, vol.4, issue.1, pp.23-41, 2014. ,
DOI : 10.15803/ijnc.4.1_23
URL : https://hal.archives-ouvertes.fr/hal-01091928
Batch Resizing Policies and Techniques for Fine-Grain Grid Tasks: The Nuts and Bolts, Journal of Information Processing Systems, vol.7, issue.2 ,
DOI : 10.3745/JIPS.2011.7.2.299
Fault-Tolerance Techniques for High-Performance Computing, 2015. ,
DOI : 10.1007/978-3-319-20943-2
URL : https://hal.archives-ouvertes.fr/hal-01200479
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Efficient collective communication in distributed heterogeneous systems, JPDC, vol.63, issue.3, pp.251-263, 2003. ,
DOI : 10.1109/icdcs.1999.776502
Resilient application co-scheduling with processor redistribution, Research report RR-8795, INRIA, available at graal, 2015. ,
DOI : 10.1109/icpp.2016.21
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/inria-00560582
Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, vol.9, issue.16, pp.2772-2791, 2014. ,
DOI : 10.1109/SNAPI.2010.10
URL : https://hal.archives-ouvertes.fr/hal-00696154