S. Tomov, R. Nath, P. Du, and J. Dongarra, Magma, matrix algebra on gpu and multicore architectures

G. Guennebaud and B. Jacob, Eigen v3, 2016.

A. R. Terán, L. Lacassagne, A. H. Zahraee, and M. Gouiffes, Real-time covariance tracking algorithm for embedded systems, Design and Architectures for Signal and Image Processing (DASIP), 2013 Conference on, pp.104-111, 2013.

R. Frühwirth, Application of Kalman filtering to track and vertex fitting Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, pp.444-450, 1987.

D. Beymer, P. Mclauchlan, B. Coifman, and J. Malik, A real-time computer vision system for measuring traffic parameters, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.495-501, 1997.
DOI : 10.1109/CVPR.1997.609371

J. Shin, M. W. Hall, J. Chame, C. Chen, and P. D. Hovland, Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology, Software Automatic Tuning, pp.353-370, 2011.
DOI : 10.1007/978-1-4419-6935-4_20

X. Tian, H. Saito, S. V. Preis, E. N. Garcia, S. S. Kozhukhov et al., Effective SIMD Vectorization for Intel Xeon Phi Coprocessors, Scientific Programming, pp.1-14, 2015.
DOI : 10.1007/978-3-642-30961-8_5

I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin et al., High-Performance Matrix-Matrix Multiplications of Very Small Matrices, European Conference on Parallel Processing, pp.659-671, 2016.
DOI : 10.1109/ICPPW.2012.39
URL : https://hal.archives-ouvertes.fr/hal-01409286

T. Dong, A. Haidar, S. Tomov, and J. Dongarra, A Fast Batched Cholesky Factorization on a GPU, 2014 43rd International Conference on Parallel Processing, pp.432-440, 2014.
DOI : 10.1109/ICPP.2014.52
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.637.5351

N. J. Higham, Accuracy and stability of numerical algorithms, SIAM, 2002.
DOI : 10.1137/1.9780898718027

N. J. Higham, Cholesky factorization, Wiley Interdisciplinary Reviews: Computational Statistics, vol.103, issue.2, pp.251-254, 2009.
DOI : 10.1137/1.9781611971484

T. Dong, A. Haidar, P. Luszczek, J. A. Harris, S. Tomov et al., LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), pp.157-160, 2014.
DOI : 10.1109/HPCC.2014.30

L. Lacassagne, D. Etiemble, A. Hassan-zahraee, A. Dominguez, and P. Vezolle, High level transforms for SIMD and low-level computer vision algorithms, Proceedings of the 2014 Workshop on Workshop on programming models for SIMD/Vector processing, WPMVP '14, pp.49-56, 2014.
DOI : 10.1145/2568058.2568067
URL : https://hal.archives-ouvertes.fr/hal-01094906

I. Masliah, M. Baboulin, and J. Falcou, Metaprogramming Dense Linear Algebra Solvers Applications to Multi and Many-Core Architectures, 2015 IEEE Trustcom/BigDataSE/ISPA, pp.69-76, 2015.
DOI : 10.1109/Trustcom.2015.614
URL : https://hal.archives-ouvertes.fr/hal-01221358

J. Abel, K. Balasubramanian, M. Bargeron, T. Craver, and M. Phlipot, Applications tuning for streaming SIMD extensions, Intel Technology Journal, vol.2, 1999.

J. Iliffe, The use of the genie system in numerical calculation, Annual Review in Automatic Programming, vol.2, pp.1-28, 1961.

A. Fog, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, pp.2016-2017, 2016.

P. Soderquist, M. Leeser-]-c, and . Lomont, Area and performance tradeoffs in floating-point divide and square-root implementations, ACM Computing Surveys, vol.28, issue.3, pp.518-564, 1996.
DOI : 10.1145/243439.243481

V. Y. Pan, METHODS OF COMPUTING VALUES OF POLYNOMIALS, Russian Mathematical Surveys, vol.21, issue.1, pp.105-136, 1966.
DOI : 10.1070/RM1966v021n01ABEH004147