Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Pierre Blanchard; Nicholas J Higham; Florent Lopez; Théo Mary; Srikara Pranesh

doi:10.1137/19M1289546

Article Dans Une Revue SIAM Journal on Scientific Computing Année : 2020

Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

(1) , (1) , (2) , (3, 4) , (1)

1
2
3
4

Pierre Blanchard

Fonction : Auteur
PersonId : 769919
IdRef : 201036479

Department of Mathematics [Manchester]

Nicholas J Higham

Fonction : Auteur

Department of Mathematics [Manchester]

Florent Lopez

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Théo Mary

Fonction : Auteur
PersonId : 178018
IdHAL : tmary
ORCID : 0000-0001-9949-4634
IdRef : 230009417

Centre National de la Recherche Scientifique

Performance et Qualité des Algorithmes Numériques

Srikara Pranesh

Fonction : Auteur

Department of Mathematics [Manchester]

Résumé

Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative refinement with block FMAs, for which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which differ in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU confirm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.

Mots clés

NVIDIA GPU matrix multiplication rounding error analysis floating-point arithmetic fused multiply-add tensor cores LU factorization

Domaines

Informatique [cs] Analyse numérique [cs.NA] Calcul parallèle, distribué et partagé [cs.DC] Mathématiques [math] Analyse numérique [math.NA]

Fichier principal

BlockFMA.pdf (390.05 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Theo Mary : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02491076

Soumis le : jeudi 28 mai 2020-11:24:27

Dernière modification le : lundi 15 avril 2024-16:07:08

Dates et versions

hal-02491076 , version 1 (25-02-2020)

hal-02491076 , version 2 (28-05-2020)

Identifiants

HAL Id : hal-02491076 , version 2
DOI : 10.1137/19M1289546

Citer

Pierre Blanchard, Nicholas J Higham, Florent Lopez, Théo Mary, Srikara Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing, 2020, 42 (3), pp.C124-C141. ⟨10.1137/19M1289546⟩. ⟨hal-02491076v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIP6 TDS-MACS SORBONNE-UNIVERSITE SU-SCIENCES

120 Consultations

597 Téléchargements

Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager