Using replication and checkpointing for reliable task management in computational Grids

Sangho Yi; Derrick Kondo; Bongjae Kim; Geunyoung Park; Yookun Cho

doi:10.1109/HPCS.2010.5547140

Communication Dans Un Congrès Année : 2010

Using replication and checkpointing for reliable task management in computational Grids

(1) , (1) , (2) , (2) , (2)

1
2

Sangho Yi

Fonction : Auteur

Middleware efficiently scalable

Derrick Kondo

Fonction : Auteur

Middleware efficiently scalable

Bongjae Kim

Fonction : Auteur

Seoul National University [Seoul]

Geunyoung Park

Fonction : Auteur

Seoul National University [Seoul]

Yookun Cho

Fonction : Auteur

Seoul National University [Seoul]

Résumé

In large-scale Grid computing environments, providing fault-tolerance is required for both scientific computation and file-sharing to increase their reliability. In previous works, several mechanisms were proposed for the Grids or distributed computing systems. However, some of them used only space redundancy (hardware replication), and others used only time redundancy (checkpointing and rollback). For this reason, the existing mechanisms are inefficient in terms of their resource utilization on the Grids. The main goal of ART is reducing the number of replications by using checkpointing and rollback scheme for each replication. In ART, the minimum number of replications is adaptively selected based on analysis of probability of successful execution within the given deadline and reliability requirement of each task. Our simulation results show that ART can significantly reduce the number of replications and improve scalability compared with existing mechanisms.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Arnaud Legrand : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00788867

Soumis le : vendredi 15 février 2013-13:09:44

Dernière modification le : jeudi 4 avril 2024-20:50:15

Dates et versions

hal-00788867 , version 1 (15-02-2013)

Identifiants

HAL Id : hal-00788867 , version 1
DOI : 10.1109/HPCS.2010.5547140

Citer

Sangho Yi, Derrick Kondo, Bongjae Kim, Geunyoung Park, Yookun Cho. Using replication and checkpointing for reliable task management in computational Grids. IEEE International Conference on High Performance Computing & Simulation (HPCS), 2010, Caen, France. pp.125-131, ⟨10.1109/HPCS.2010.5547140⟩. ⟨hal-00788867⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA LIG INRIA2 LIG_SIDCH

99 Consultations

0 Téléchargements

Using replication and checkpointing for reliable task management in computational Grids

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager