C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, vol.5, issue.3, pp.308-323, 1979.
DOI : 10.1145/355841.355847

J. J. Dongarra, J. D. Croz, S. Hammarling, and R. J. Hanson, An extended set of FORTRAN basic linear algebra subprograms, ACM Transactions on Mathematical Software, vol.14, issue.1, pp.1-17, 1988.
DOI : 10.1145/42288.42291

J. J. Dongarra, J. D. Croz, S. Hammarling, and I. Duff, A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software, vol.16, issue.1, pp.1-17, 1990.
DOI : 10.1145/77626.79170

R. C. Whaley and J. J. Dongarra, Automatically Tuned Linear Algebra Software, Proceedings of the IEEE/ACM SC98 Conference, pp.1-27, 1998.
DOI : 10.1109/SC.1998.10004

K. Goto and R. A. Van-de-geijn, Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software, vol.34, issue.3, 2008.
DOI : 10.1145/1356052.1356053

K. Goto and R. A. Van-de-geijn, High-performance implementation of the level-3 BLAS, ACM Transactions on Mathematical Software, vol.35, issue.1, 2008.
DOI : 10.1145/1377603.1377607

D. Fabregat-traver, Y. Aulchenko, and P. Bientinesi, Solving sequences of generalized least-squares problems on multi-threaded architectures, Applied Mathematics and Computation, vol.234, pp.606-617, 2014.
DOI : 10.1016/j.amc.2014.02.056

D. Fabregat-traver and P. Bientinesi, Computing Petaflops over Terabytes of Data, ACM Transactions on Mathematical Software, vol.40, issue.4, 2014.
DOI : 10.1145/2560421

K. Bergman, Exascale computing study: Technology challenges in achieving exascale systems, DARPA Report, 2008.

N. Whitehead and A. Fit-florea, Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs, 2011.

M. Corden, Differences in floating-point arithmetic between Intel R Xeon R processors and the Intel R Xeon Phi TM coprocessor, 2013.

K. Doertel, Best known method: Avoid heterogeneous precision in control flow calculations, 2013.

U. Kulisch and V. Snyder, The exact dot product as basic tool for long interval arithmetic, Computing, vol.205, issue.3, pp.307-313, 2011.
DOI : 10.1007/s00607-010-0127-7

S. Collange, D. Defour, S. Graillat, and R. Iakymchuk, Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi-and Many-Core Architectures, 2014.

N. J. Higham, Accuracy and stability of numerical algorithms, second ed, Society for Industrial and Applied Mathematics (SIAM), 2002.

X. S. Li, J. W. Demmel, D. H. Bailey, G. Henry, Y. Hida et al., Design, implementation and testing of extended and mixed precision BLAS, ACM Transactions on Mathematical Software, vol.28, issue.2, pp.152-205, 2002.
DOI : 10.1145/567806.567808

Y. Hida, X. S. Li, and D. H. Bailey, Algorithms for quad-double precision floating point arithmetic, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001, pp.155-162, 2001.
DOI : 10.1109/ARITH.2001.930115

D. E. Knuth, The Art of Computer Programming, Seminumerical Algorithms, vol.2, 1997.

K. Matsumoto, N. Nakasato, T. Sakai, H. Yahagi, and S. G. Sedukhin, Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems, Procedia Computer Science, vol.4, pp.342-351, 2011.
DOI : 10.1016/j.procs.2011.04.036

J. Demmel and H. D. Nguyen, Fast Reproducible Floating-Point Summation, 2013 IEEE 21st Symposium on Computer Arithmetic, pp.163-172, 2013.
DOI : 10.1109/ARITH.2013.9

J. Demmel and H. D. Nguyen, Numerical Reproducibility and Accuracy at ExaScale (invited talk), Proceedings of the 21st IEEE Symposium on Computer Arithmetic, pp.235-237, 2013.