K. Asanovi´casanovi´c, Vector microprocessors, 1998.

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, 2009 IEEE International Symposium on Workload Characterization (IISWC), pp.44-54, 2009.
DOI : 10.1109/IISWC.2009.5306797

D. Sylvain-collange, Y. Defour, and . Zhang, Dynamic detection of uniform and affine vectors in GPGPU computations, Europar 3rd Workshop on Highly Parallel Processing on a Chip, 2009.

G. Diamos, The design and implementation ocelot's dynamic binary translator from PTX to multi-core x86, 2009.

G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark, Ocelot, Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, 2010.
DOI : 10.1145/1854273.1854318

J. Gummaraju and M. Rosenblum, Stream Programming on General-Purpose Processors, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05), pp.343-354, 2005.
DOI : 10.1109/MICRO.2005.32

. Intel, Intel 64 and IA-32 Architectures Software Developer's Manuals, Architecture, vol.1, 2010.

A. Kerr, G. Diamos, and S. Yalamanchili, A characterization and analysis of GPGPU kernels, 2009.

C. Lattner and V. Adve, LLVM: A compilation framework for lifelong program analysis & transformation, International Symposium on Code Generation and Optimization, 2004. CGO 2004., p.75, 2004.
DOI : 10.1109/CGO.2004.1281665

J. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, vol.28, issue.2, pp.39-55, 2008.
DOI : 10.1109/MM.2008.31

A. Munshi, The OpenCL specification. Khronos OpenCL Working Group, 2009.

S. Oberman, G. Favor, and F. Weber, AMD 3DNow! technology: architecture and implementations, IEEE Micro, vol.19, issue.2, pp.37-48, 1999.
DOI : 10.1109/40.755466

O. Rosenberg, Optimizing opencl on cpus, OpenCL BOF in SIGGRAPH, 2010.

J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy et al., Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs, Proceedings of the 8th annual IEEE/ ACM international symposium on Code generation and optimization, CGO '10, pp.111-119, 2010.
DOI : 10.1145/1772954.1772971

J. A. Stratton, S. S. Stone, W. , and W. Hwu, MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs, Languages and Compilers for Parallel Computing: 21th International Workshop, pp.16-30, 2008.
DOI : 10.1007/978-3-540-89740-8_2

V. Volkov, Programming inverse memory hierarchy: case of stencils on GPUs, GPU Workshop for Scientific Computing, International Conference on Parallel Computational Fluid Dynamics (ParCFD), 2010.

V. Volkov and J. W. Demmel, Benchmarking GPUs to tune dense linear algebra, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2008.
DOI : 10.1109/SC.2008.5214359