Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, vol.5, issue.3, pp.308-323, 1979. ,
DOI : 10.1145/355841.355847
An extended set of FORTRAN basic linear algebra subprograms, ACM Transactions on Mathematical Software, vol.14, issue.1, pp.1-17, 1988. ,
DOI : 10.1145/42288.42291
A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software, vol.16, issue.1, pp.1-17, 1990. ,
DOI : 10.1145/77626.79170
Automatically Tuned Linear Algebra Software, Proceedings of the IEEE/ACM SC98 Conference, pp.1-27, 1998. ,
DOI : 10.1109/SC.1998.10004
Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software, vol.34, issue.3, 2008. ,
DOI : 10.1145/1356052.1356053
High-performance implementation of the level-3 BLAS, ACM Transactions on Mathematical Software, vol.35, issue.1, 2008. ,
DOI : 10.1145/1377603.1377607
Solving sequences of generalized least-squares problems on multi-threaded architectures, Applied Mathematics and Computation, vol.234, pp.606-617, 2014. ,
DOI : 10.1016/j.amc.2014.02.056
Computing Petaflops over Terabytes of Data, ACM Transactions on Mathematical Software, vol.40, issue.4, 2014. ,
DOI : 10.1145/2560421
Exascale computing study: Technology challenges in achieving exascale systems, DARPA Report, 2008. ,
Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs, 2011. ,
Differences in floating-point arithmetic between Intel R Xeon R processors and the Intel R Xeon Phi TM coprocessor, 2013. ,
Best known method: Avoid heterogeneous precision in control flow calculations, 2013. ,
The exact dot product as basic tool for long interval arithmetic, Computing, vol.205, issue.3, pp.307-313, 2011. ,
DOI : 10.1007/s00607-010-0127-7
Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi-and Many-Core Architectures, 2014. ,
Accuracy and stability of numerical algorithms, second ed, Society for Industrial and Applied Mathematics (SIAM), 2002. ,
Design, implementation and testing of extended and mixed precision BLAS, ACM Transactions on Mathematical Software, vol.28, issue.2, pp.152-205, 2002. ,
DOI : 10.1145/567806.567808
Algorithms for quad-double precision floating point arithmetic, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001, pp.155-162, 2001. ,
DOI : 10.1109/ARITH.2001.930115
The Art of Computer Programming, Seminumerical Algorithms, vol.2, 1997. ,
Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems, Procedia Computer Science, vol.4, pp.342-351, 2011. ,
DOI : 10.1016/j.procs.2011.04.036
Fast Reproducible Floating-Point Summation, 2013 IEEE 21st Symposium on Computer Arithmetic, pp.163-172, 2013. ,
DOI : 10.1109/ARITH.2013.9
Numerical Reproducibility and Accuracy at ExaScale (invited talk), Proceedings of the 21st IEEE Symposium on Computer Arithmetic, pp.235-237, 2013. ,