Ensuring Application Continuity with Fault Tolerance Techniques - Archive ouverte HAL Accéder directement au contenu
Chapitre D'ouvrage Année : 2023

Ensuring Application Continuity with Fault Tolerance Techniques

Résumé

A cloud is an attractive environment for executing high-performance (HPC) applications. There is an extensive and consolidated history of long-running HPC applications that were deployed on clouds or successfully migrated from clusters to clouds not only because the latter provides flexibility and access to virtually infinite resources but also because clouds are offered to users as failure-free platforms. However, outages are not uncommon in clouds and, in this case, the cloud provider and/or HPC applications need to implement fault tolerance mechanisms in order to ensure reliability and the correct execution of the applications. In this chapter, we present an overview of the related literature about fault tolerance (FT) techniques most used by clouds and HPC applications that run on them, basically checkpoint-rollback and replication, as well as fault detection approaches and existing reliable storage in clouds

Mots clés

Fichier non déposé

Dates et versions

hal-04388577 , version 1 (11-01-2024)

Identifiants

Citer

Rafaela Brum, Luan Teylo, Luciana Arantes, Pierre Sens. Ensuring Application Continuity with Fault Tolerance Techniques. High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment, Springer International Publishing, pp.191-212, 2023, 978-3-031-29769-4. ⟨10.1007/978-3-031-29769-4_10⟩. ⟨hal-04388577⟩
19 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More