S. Kalathingal, S. Collange, B. N. Swamy, and A. Seznec, Dynamic interthread vectorization architecture: extracting DLP from TLP, International Symposium on Computer Architecture and High-Performance Computing (SBAC-PAD)
DOI : 10.1109/sbac-pad.2016.11

URL : https://hal.archives-ouvertes.fr/hal-01356202

D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous multithreading: Maximizing on-chip parallelism, Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, pp.392-403, 1995.

D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo et al., Exploiting choice, ACM SIGARCH Computer Architecture News, vol.24, issue.2, pp.191-202, 1996.
DOI : 10.1145/232974.232993

C. Bienia, S. Kumar, J. P. Singh, and K. Li, The PARSEC benchmark suite, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pp.72-81, 2008.
DOI : 10.1145/1454115.1454128

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, 2009 IEEE International Symposium on Workload Characterization (IISWC), pp.44-54, 2009.
DOI : 10.1109/IISWC.2009.5306797

A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, Design tradeoffs for the alpha EV8 conditional branch predictor, 29th International Symposium on Computer Architecture, pp.25-29, 2002.

S. Hily and A. Seznec, Branch prediction and simultaneous multithreading, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique, pp.169-173552664, 1996.
DOI : 10.1109/PACT.1996.552664

URL : https://hal.archives-ouvertes.fr/inria-00073847

S. Hily and A. Seznec, Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading, Proceedings Fifth International Symposium on High-Performance Computer Architecture, pp.64-67744331, 1999.
DOI : 10.1109/HPCA.1999.744331

URL : https://hal.archives-ouvertes.fr/inria-00073298

T. Milanez, S. Collange, F. M. Pereira, W. Meira, and R. Ferreira, Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads, Parallel Computing, vol.40, issue.9, pp.548-558, 2014.
DOI : 10.1016/j.parco.2014.03.006

URL : https://hal.archives-ouvertes.fr/hal-01087054

R. M. Russell, The CRAY-1 computer system, Communications of the ACM, vol.21, issue.1, pp.63-72, 1978.
DOI : 10.1145/359327.359336

R. Karrenberg and S. Hack, Whole-function vectorization, Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp.141-150, 2011.
DOI : 10.1109/cgo.2011.5764682

URL : http://www.intel-vci.uni-saarland.de/uploads/tx_sibibtex/10.pdf

Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler et al., Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp.101-113, 2014.
DOI : 10.1109/MICRO.2014.48

Y. Lee, A. Waterman, R. Avizienis, H. Cook, C. Sun et al., Asanovic, A 45nm 1.3 ghz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators, European Solid State Circuits Conference, pp.2014-2054, 2014.

J. Nickolls and W. J. Dally, The GPU Computing Era, IEEE Micro, vol.30, issue.2, pp.56-69, 2010.
DOI : 10.1109/MM.2010.41

G. Diamos, A. Kerr, H. Wu, S. Yalamanchili, B. Ashbaugh et al., SIMD re-convergence at thread frontiers, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, 2011.
DOI : 10.1145/2155620.2155676

J. Menon, M. De-kruijf, and K. Sankaralingam, iGPU, ACM SIGARCH Computer Architecture News, vol.40, issue.3, pp.72-83, 2012.
DOI : 10.1145/2366231.2337168

W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, Dynamic warp formation, ACM Transactions on Architecture and Code Optimization, vol.6, issue.2, pp.1-7, 2009.
DOI : 10.1145/1543753.1543756

N. Brunie, S. Collange, and G. Diamos, Simultaneous branch and warp interweaving for sustained GPU performance, ACM SIGARCH Computer Architecture News, vol.40, issue.3, pp.49-60, 2012.
DOI : 10.1145/2366231.2337166

URL : https://hal.archives-ouvertes.fr/ensl-00649650

A. Lashgar, A. Khonsari, and A. Baniasadi, HARP, ACM Transactions on Embedded Computing Systems, vol.13, issue.3s, pp.13-16, 2014.
DOI : 10.1007/s02011-011-1137-8

D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo et al., Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor, Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp.191-202, 1996.

F. J. Cazorla, A. Ramírez, M. Valero, and E. Fernández, Dynamically Controlled Resource Allocation in SMT Processors, 37th International Symposium on Microarchitecture (MICRO-37'04), pp.171-18217, 2004.
DOI : 10.1109/MICRO.2004.17

A. El-moursy and D. H. Albonesi, Front-end policies for improved issue efficiency in SMT processors, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., pp.31-401183522, 2003.
DOI : 10.1109/HPCA.2003.1183522

S. Eyerman and L. Eeckhout, A memory-level parallelism aware fetch policy for SMT processors, 13st International Conference on High-Performance Computer Architecture, pp.240-249, 2007.

M. J. Quinn, P. J. Hatcher, and K. C. Jourdenais, Compiling C* programs for a hypercube multicomputer, ACM SIGPLAN Notices, vol.23, issue.9, pp.57-65, 1988.
DOI : 10.1145/62116.62122

S. Collange, Stack-less simt reconvergence at low cost, Tech. rep., HAL, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00622654

R. Kumar, N. P. Jouppi, and D. M. Tullsen, Conjoined-Core Chip Multiprocessing, 37th International Symposium on Microarchitecture (MICRO-37'04), pp.195-206, 2004.
DOI : 10.1109/MICRO.2004.12

J. González, Q. Cai, P. Chaparro, G. Magklis, R. Rakvic et al., Thread fusion, Proceeding of the thirteenth international symposium on Low power electronics and design, ISLPED '08, pp.363-368, 2008.
DOI : 10.1145/1393921.1394018

G. Long, D. Franklin, S. Biswas, P. Ortiz, J. Oberg et al., Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp.337-348, 2010.
DOI : 10.1109/MICRO.2010.41

M. Dechene, E. Forbes, and E. Rotenberg, Multithreaded instruction sharing

M. Mckeown, J. Balkind, and D. Wentzlaff, Execution Drafting: Energy Efficiency through Computation Deduplication, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp.432-444, 2014.
DOI : 10.1109/MICRO.2014.43

M. J. Flynn, Some Computer Organizations and Their Effectiveness, IEEE Transactions on Computers, vol.21, issue.9, pp.948-960, 1972.
DOI : 10.1109/TC.1972.5009071

J. Meng, D. Tarjan, and K. Skadron, Dynamic warp subdivision for integrated branch and memory divergence tolerance, ACM SIGARCH Computer Architecture News, vol.38, issue.3, pp.235-246, 2010.
DOI : 10.1145/1816038.1815992

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser et al., Pin, ACM SIGPLAN Notices, vol.40, issue.6, pp.190-200, 2005.
DOI : 10.1145/1064978.1065034

A. Seznec, A new case for the TAGE branch predictor, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pp.117-127, 2011.
DOI : 10.1145/2155620.2155635

URL : https://hal.archives-ouvertes.fr/hal-00639193

J. Meng and K. Skadron, Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling, 2009 IEEE International Conference on Computer Design, pp.282-288, 2009.
DOI : 10.1109/ICCD.2009.5413143

M. O. Connor, Highlights of the High-Bandwidth Memory (HBM) standard, in: Memory Forum Workshop, 2014.

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen et al., McPAT, Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Micro-42, pp.42-469, 2009.
DOI : 10.1145/1669112.1669172

S. L. Xi, H. M. Jacobson, P. Bose, G. Wei, and D. M. Brooks, Quantifying sources of error in McPAT and potential impacts on architectural studies, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp.577-589, 2015.
DOI : 10.1109/HPCA.2015.7056064

S. Collange, D. Defour, and Y. Zhang, Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations, Europar 3rd Workshop on Highly Parallel Processing on a Chip (HPPC), pp.46-55, 2009.
DOI : 10.1007/978-3-642-14122-5_8

URL : https://hal.archives-ouvertes.fr/hal-00396719