A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2004

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

Résumé

A new kind of application is born. Code coupling applications consist of applications that can be divided into modules. They often need to run on several clusters. However, in these huge architectures that we call ``cluster federations'', there's a large number of nodes. Faults may appear very frequently. Thus a fault tolerance mechanism that fits these architectures and these kind of applications should be provided. We propose a hierarchical checkpointing protocol that combines synchronized methods inside clusters and communication induced methods between clusters. Our protocol has been evaluated by a discrete event simulation. The first results show that it works well for the targeted applications.
Fichier principal
Vignette du fichier
MonMorBad04FTPDS.pdf (84.62 Ko) Télécharger le fichier
Loading...

Dates et versions

inria-00000990 , version 1 (10-01-2006)
inria-00000990 , version 2 (10-01-2006)

Identifiants

  • HAL Id : inria-00000990 , version 2

Citer

Sébastien Monnet, Christine Morin, Ramamurthy Badrinath. A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations. 9th IEEE Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems, Apr 2004, Santa Fe, New Mexico, Mexico. pp.211. ⟨inria-00000990v2⟩
148 Consultations
155 Téléchargements

Partager

Gmail Facebook X LinkedIn More