Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

Thomas Hérault; Yves Robert; Aurélien Bouteiller; Dorian Arnold; Kurt B Ferreira; George Bosilca; Jack Dongarra

Rapport (Rapport De Recherche) Année : 2017

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

(1) , (2, 1, 3) , (1) , (4) , (5) , (1) , (6, 1)

1
2
3
4
5
6

Thomas Hérault

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Yves Robert

Fonction : Auteur
PersonId : 739318
IdHAL : yves-robert
ORCID : 0000-0003-2361-055X
IdRef : 029813611

Laboratoire de l'Informatique du Parallélisme

Innovative Computing Laboratory [Knoxville]

Optimisation des ressources : modèles, algorithmes et ordonnancement

Aurélien Bouteiller

Fonction : Auteur
PersonId : 863938

Innovative Computing Laboratory [Knoxville]

Dorian Arnold

Fonction : Auteur

Emory University [Atlanta, GA]

Kurt B Ferreira

Fonction : Auteur

Sandia National Laboratories [Albuquerque]

George Bosilca

Fonction : Auteur
PersonId : 863939

Innovative Computing Laboratory [Knoxville]

Jack Dongarra

Fonction : Auteur
PersonId : 863940

University of Manchester [Manchester]

Innovative Computing Laboratory [Knoxville]

Résumé

In high-performance computing environments, input/output (I/O) from various sources often contend for scare available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) places an additional burden as it increases I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval as defined by Young/Daly, while providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.

Ce rapport s’intéresse aux plates-formes de calcul scientifique partagées, i.e., sur lesquelles s’exécutent simultanément plusieurs classes d’applications. Celles-ci sont en compétition pour l’accès aux ressources d’entrées-sorties, à la fois pour leurs opérations de base et pour prendre leurs checkpoints. Nous proposons un modèle et analysons plusieurs stratégies de prise de checkpoints, à période fixe ou dépendant de l’application, avec ou sans interférence, bloquante ou non. Nous déterminons une borne inférieure sur la fraction de temps nécessairement perdue par la plateforme pour toute stratégie de checkpoint/redémarrage, et nous montrons expérimentalement que notre stratégie coopérative obtient des performances très proches de cette borne. Dans notre stratégie coopérative, les périodes de checkpoint des applications ne sont pas nécessairement celles calculées par la formule de Young/Daly, car la bande passante disponible ne permet pas toujours de les mettre en oeuvre, et certaines applications ont nécessairement une période plus longue (et donc sous-optimale). Nous donnons les résultats d’un ensemble de simulations menées avec des ensembles de paramètres pour les applications et les plates-formes qui correspondent à des scénarios actuels et prospectifs.

Mots clés

scheduling policy shared platform cooperative checkpoint I/O contention

résilience checkpoint optimisation I/O stratégie d’ordonnancement

Domaines

Informatique [cs]

Fichier principal

rr9109.pdf (900.06 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01621295

Soumis le : lundi 23 octobre 2017-10:57:20

Dernière modification le : mardi 31 octobre 2023-11:26:05

Archivage à long terme le : mercredi 24 janvier 2018-13:12:09

Dates et versions

hal-01621295 , version 1 (23-10-2017)

Identifiants

HAL Id : hal-01621295 , version 1

Citer

Thomas Hérault, Yves Robert, Aurélien Bouteiller, Dorian Arnold, Kurt B Ferreira, et al.. Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms. [Research Report] RR-9109, INRIA. 2017, pp.1-20. ⟨hal-01621295⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA-RRRT INRIA2 LARA UDL

280 Consultations

250 Téléchargements

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager