Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2017

Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale

Résumé

This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication needed to efficiently detect and correct silent errors. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplica-tion) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint. We provide a detailed analytical study of both scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that corroborates the analytical model.
Fichier principal
Vignette du fichier
ftxs.pdf (489.27 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02082907 , version 1 (28-03-2019)

Identifiants

Citer

Anne Benoit, Aurélien Cavelan, Franck Cappello, Padma Raghavan, Yves Robert, et al.. Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale. 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale FTXS, Jun 2017, Washington DC, United States. pp.31-38, ⟨10.1145/3086157.3086162⟩. ⟨hal-02082907⟩
45 Consultations
118 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More