Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Résumé

Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance-in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More applicationspecific choice in resilience strategies allows for better long-term flexibility, performance, and-importantly-simplicity.
Fichier principal
Vignette du fichier
Resilience_Integrations_Paper.pdf (1.16 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03772536 , version 1 (08-09-2022)

Identifiants

  • HAL Id : hal-03772536 , version 1

Citer

Matthew Whitlock, Nicolas Morales, George Bosilca, Aurélien Bouteiller, Bogdan Nicolae, et al.. Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach. 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022), Sep 2022, Heidelberg, Germany. ⟨hal-03772536⟩
32 Consultations
73 Téléchargements

Partager

Gmail Facebook X LinkedIn More