Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Bogdan Nicolae
M Mustafa Rafique
Thierry Tonellot
  • Fonction : Auteur
  • PersonId : 1110240
Franck Cappello
  • Fonction : Auteur
  • PersonId : 1102088

Résumé

Efficient checkpointing of distributed data structures periodically at key moments during runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance based on checkpoint-restart, in-situ or post-analytics, reproducibility, adjoint computations, etc. In this context, multilevel checkpointing is a popular technique: distributed processes can write their shard of the data independently to fast local storage tiers, then flush asynchronously to a shared, slower tier of higher capacity. However, given the limited capacity of fast tiers (e.g. GPU memory) and the increasing checkpoint frequency, the processes often run out of space and need to fall back to blocking writes to the slow tiers. To mitigate this problem, compression is often applied in order to reduce the checkpoint sizes. Unfortunately, this reduction is not uniform: some processes will have spare capacity left on the fast tiers, while others still run out of space. In this paper, we study the problem of how to leverage this imbalance in order to reduce I/O overheads for multi-level checkpointing. To this end, we solve an optimization problem of how much data to send from each process that runs out of space to the processes that have spare capacity in order to minimize the amount of time spent blocking in I/O. We propose two algorithms: one based on a greedy approach and the other based on modified minimum cost flows. We evaluate our proposal using synthetic and real-life application traces. Our evaluation shows that both algorithms achieve significant improvements in checkpoint performance over traditional multilevel checkpointing.
Fichier principal
Vignette du fichier
SimG__MASCOTS_2021.pdf (1.48 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03344362 , version 1 (15-09-2021)

Identifiants

  • HAL Id : hal-03344362 , version 1

Citer

Avinash Maurya, Bogdan Nicolae, M Mustafa Rafique, Thierry Tonellot, Franck Cappello. Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing. MASCOTS'21: 29th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Nov 2021, Virtual, Portugal. ⟨hal-03344362⟩
74 Consultations
127 Téléchargements

Partager

Gmail Facebook X LinkedIn More