Unified Model for Assessing Checkpointing Protocols at Extreme-Scale - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue Concurrency and Computation: Practice and Experience Année : 2013

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

George Bosilca
  • Fonction : Auteur
  • PersonId : 863939
Elisabeth Brunet
Jack Dongarra
  • Fonction : Auteur
  • PersonId : 863940
Thomas Herault
  • Fonction : Auteur
  • PersonId : 833735

Résumé

In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
Fichier principal
Vignette du fichier
concurrency-revised.pdf (589.69 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00908447 , version 1 (23-11-2013)

Identifiants

Citer

George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, et al.. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. Concurrency and Computation: Practice and Experience, 2013, 26 (17), pp.2727-2810. ⟨10.1002/cpe.3173⟩. ⟨hal-00908447⟩
551 Consultations
274 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More