GPUburn: A System to Test and Mitigate GPU Hardware Failures

Eric Petit 1 David Defour 2
2 DALI - Digits, Architectures et Logiciels Informatiques
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, UPVD - Université de Perpignan Via Domitia
Abstract : Due to many factors such as, high transistor density, high frequency, and low voltage, today's processors are more than ever subject to hardware failures. These errors have various impacts depending on the location of the error and the type of processor. Because of the hierarchical structure of the compute units and work scheduling, the hardware failure on GPUs affect only part of the application. In this paper we present a new methodology to characterize the hardware failures of Nvidia GPUs based on a software micro-benchmarking platform implemented in OpenCL. We also present which hardware part of TESLA architecture is more sensitive to intermittent errors, which usually appears when the processor is aging. We obtained these results by accelerating the aging process by running the processors at high temperature. We show that on GPUs, intermittent errors impact is limited to a localized architecture tile. Finally, we propose a methodology to detect, record location of defective units in order to avoid them to ensure the program correctness on such architectures, improving the GPU fault-tolerance capability and lifespan.
Type de document :
Communication dans un congrès
SAMOS: Embedded Computer Systems: Architectures, Modeling, and Simulation, Jul 2013, Samos, Greece. 13th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, pp.263-270, 2013, <10.1109/SAMOS.2013.6621133>
Liste complète des métadonnées


https://hal.archives-ouvertes.fr/hal-00827588
Contributeur : David Defour <>
Soumis le : mercredi 29 mai 2013 - 13:59:28
Dernière modification le : vendredi 9 juin 2017 - 10:41:47
Document(s) archivé(s) le : mardi 4 avril 2017 - 12:53:39

Fichier

GPUburn_SAMOS.pdf
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Collections

Citation

Eric Petit, David Defour. GPUburn: A System to Test and Mitigate GPU Hardware Failures. SAMOS: Embedded Computer Systems: Architectures, Modeling, and Simulation, Jul 2013, Samos, Greece. 13th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, pp.263-270, 2013, <10.1109/SAMOS.2013.6621133>. <hal-00827588>

Partager

Métriques

Consultations de
la notice

221

Téléchargements du document

560