Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Nicolas Brunie; Caroline Collange; Gregory Diamos

doi:10.1109/ISCA.2012.6237005

Communication Dans Un Congrès Année : 2012

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

(1, 2, 3) , (2) , (4)

1
2
3
4

Nicolas Brunie

Fonction : Auteur correspondant
PersonId : 915438

Connectez-vous pour contacter l'auteur

Laboratoire de l'Informatique du Parallélisme

Arithmetic and Computing

ARENAIRE - Arithmétique des ordinateurs

Caroline Collange

Fonction : Auteur correspondant
PersonId : 177452
IdHAL : caroline-collange
IdRef : 151116776

Connectez-vous pour contacter l'auteur

Arithmetic and Computing

Gregory Diamos

Fonction : Auteur
PersonId : 1122248

NVIDIA

Résumé

Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.

Domaines

Architectures Matérielles [cs.AR]

Fichier principal

Brunie_SimultaneousBranchWarpInterweaving_ISCA12.pdf (536.13 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Caroline Collange : Connectez-vous pour contacter le contributeur

https://ens-lyon.hal.science/ensl-00649650

Soumis le : mercredi 2 mai 2012-20:58:39

Dernière modification le : jeudi 15 février 2024-03:32:08

Archivage à long terme le : jeudi 15 décembre 2016-03:56:07

Dates et versions

ensl-00649650 , version 2 (02-05-2012)

Identifiants

HAL Id : ensl-00649650 , version 2
DOI : 10.1109/ISCA.2012.6237005

Citer

Nicolas Brunie, Caroline Collange, Gregory Diamos. Simultaneous Branch and Warp Interweaving for Sustained GPU Performance. 39th Annual International Symposium on Computer Architecture (ISCA), Jun 2012, Portland, OR, United States. pp.49 - 60, ⟨10.1109/ISCA.2012.6237005⟩. ⟨ensl-00649650⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON UNIV-RENNES1 CNRS INRIA UNIV-LYON1 IRISA INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UDL UR1-MATH-NUM

778 Consultations

1734 Téléchargements

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager