Skip to Main content Skip to Navigation
Conference papers

From tasks graphs to asynchronous distributed checkpointing with local restart

Romain Lion 1 Samuel Thibault 2, 1
1 STORM - STatic Optimizations, Runtime Methods
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
Abstract : The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus opening up for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10% on a dense linear algebra example.
Complete list of metadata

Cited literature [21 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02970529
Contributor : Romain Lion Connect in order to contact the contributor
Submitted on : Monday, October 19, 2020 - 12:02:28 PM
Last modification on : Saturday, February 6, 2021 - 4:37:28 PM

File

2020001221.pdf
Files produced by the author(s)

Identifiers

Collections

Citation

Romain Lion, Samuel Thibault. From tasks graphs to asynchronous distributed checkpointing with local restart. FTXS 2020 - IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale, Nov 2020, Atlanta / Virtual, United States. ⟨10.1109/FTXS51974.2020.00009⟩. ⟨hal-02970529v2⟩

Share

Metrics

Record views

165

Files downloads

451