A. De-boelelaan, R. Van-nieuwpoort, R. V. Van-nieuwpoort, J. Maassen, J. Maassen et al., Adaptive load-balancing for divide-and-conquer grid applications, J. of Supercomputing, 2004.

B. Ackland, A. Anesko, D. Brinthaupt, S. J. Daubert, A. Kalavade et al., A single-chip, 1.6- billion, 16-b mac/s multiprocessor dsp. Solid-State Circuits, IEEE Journal, issue.3, pp.35412-424, 2000.

A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, An evaluation of directory schemes for cache coherence, ACM SIGARCH Computer Architecture News, vol.16, issue.2, pp.280-298, 1988.
DOI : 10.1145/633625.52432

. Spirosn, . Agathos, . Panagiotise, . Hadjidoukas, . Vassiliosv et al., Task-based execution of nested openmp loops, OpenMP in a Heterogeneous World, pp.210-222, 2012.

H. Al-zoubi, A. Milenkovic, and M. Milenkovic, Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite, Proceedings of the 42nd annual Southeast regional conference on , ACM-SE 42, pp.267-272, 2004.
DOI : 10.1145/986537.986601

E. Allen, D. Chase, J. Hallett, V. Luchangco, J. Maessen et al., al. The fortress language specification, Sun Microsystems, vol.139, p.140, 2005.
URL : https://hal.archives-ouvertes.fr/jpa-00217210

G. Almasi, Pgas (partitioned global address space) languages, Encyclopedia of Parallel Computing, pp.1539-1545, 2011.

R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield et al., The Tera computer system, ACM SIGARCH Computer Architecture News, vol.18, issue.3, pp.1-6, 1990.
DOI : 10.1145/255129.255132

J. Archibald and J. Baer, Cache coherence protocols: evaluation using a multiprocessor simulation model, ACM Transactions on Computer Systems, vol.4, issue.4, pp.273-298, 1986.
DOI : 10.1145/6513.6514

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.118.8940

J. Archibald and J. Baer, Cache coherence protocols: evaluation using a multiprocessor simulation model, ACM Transactions on Computer Systems, vol.4, issue.4, pp.273-298, 1986.
DOI : 10.1145/6513.6514

S. Nimar, R. D. Arora, C. G. Blumofe, and . Plaxton, Thread scheduling for multiprogrammed multiprocessors, Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pp.119-129, 1998.

J. Baer and W. Wang, On the inclusion properties for multi-level cache hierarchies, Proceedings of the 15th Annual International Symposium on Computer Architecture, ISCA '88, pp.73-80, 1988.

K. Bathe, E. Ramm, and E. L. Wilson, Finite element formulations for large deformation dynamic analysis, International Journal for Numerical Methods in Engineering, vol.7, issue.2, pp.353-386, 1975.
DOI : 10.1002/nme.1620090207

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.208.5272

L. A. Belady, A study of replacement algorithms for a virtual-storage computer, IBM Systems Journal, vol.5, issue.2, pp.78-101, 1966.
DOI : 10.1147/sj.52.0078

M. Bender, . Michaelo, and . Rabin, Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk, Theory of Computing Systems, vol.35, issue.3, pp.289-304, 2002.
DOI : 10.1007/s00224-002-1055-5

P. Besl, A case study comparing aos (arrays of structures) and soa (structures of arrays) data layouts for a compute-intensive loop run on intel xeon processors and intel xeon phi product family coprocessors, 2013.

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall et al., Cilk : An efficient multithreaded runtime system, Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pp.207-216, 1995.
DOI : 10.1006/jpdc.1996.0107

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3175

D. Robert, C. E. Blumofe, and . Leiserson, Scheduling multithreaded computations by work stealing, J. ACM, vol.46, issue.5, pp.720-748, 1999.

T. Brecht, On the importance of parallel application placement in numa multiprocessors

G. Breinholt and C. Schierz, Algorithm 781: generating Hilbert's space-filling curve by recursion, ACM Transactions on Mathematical Software, vol.24, issue.2, pp.184-189, 1998.
DOI : 10.1145/290200.290219

F. Broquedis, O. Aumage, B. Goglin, S. Thibault, P. Wacrenier et al., Structuring the execution of OpenMP applications for multicore architectures, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp.1-10, 2010.
DOI : 10.1109/IPDPS.2010.5470442

URL : https://hal.archives-ouvertes.fr/inria-00441472

F. , W. Burton, and M. R. Sleep, Executing functional programs on a virtual tree of processors, Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture, FPCA '81, pp.187-194, 1981.

R. Arthur and . Butz, Convergence with hilbert's space filling curve, Journal of Computer and System Sciences, vol.3, issue.2, pp.128-146, 1969.

M. Castro, L. G. Fernandes, C. Pousa, J. Mehaut, and M. S. De-aguiar, Numaictm : A parallel version of ictm exploiting memory placement strategies for numa machines, Parallel Distributed Processing IEEE International Symposium on, pp.1-8, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00788917

M. Cazenave, Méthode des éléments finis : Approche pratique en mécanique des structures. Dunod, 2010.

D. Chaiken, J. Kubiatowicz, and A. Agarwal, LimitLESS directories, ACM SIGPLAN Notices, vol.26, issue.4, pp.224-234, 1991.
DOI : 10.1145/106973.106995

B. L. Chamberlain, D. Callahan, and H. P. Zima, Parallel Programmability and the Chapel Language, International Journal of High Performance Computing Applications, vol.21, issue.3, pp.291-312, 2007.
DOI : 10.1177/1094342007078442

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.187.7600

D. Chamoret, Modélisation du contact : nouvelles approches numériques, These de doctorat, 2002.

R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. Mcdonald et al., Parallel Programming in OpenMP, 2001.

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra et al., X10, ACM SIGPLAN Notices, vol.40, issue.10, pp.519-538, 2005.
DOI : 10.1145/1103845.1094852

URL : https://hal.archives-ouvertes.fr/in2p3-00166974

M. Trishul, B. Chilimbi, J. R. Davidson, and . Larus, Cache-conscious structure definition, Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, pp.13-25, 1999.

P. Clauss and V. Loechner, Parametric analysis of polyhedral iteration spaces Journal of VLSI signal processing systems for signal, image and video technology, pp.179-194, 1998.

U. Consortium, Upc language specifications v1. 2, 2005.

G. Contreras and M. Martonosi, Characterizing and improving the performance of Intel Threading Building Blocks, 2008 IEEE International Symposium on Workload Characterization, pp.57-66, 2008.
DOI : 10.1109/IISWC.2008.4636091

J. Craveur, Modélisation des éléments finis : Cours et exercices corrigés, 2008.

E. Cuthill and J. Mckee, Reducing the bandwidth of sparse symmetric matrices, Proceedings of the 1969 24th national conference on -, pp.157-172, 1969.
DOI : 10.1145/800195.805928

E. Cuthill, Several Strategies for Reducing the Bandwidth of Matrices, Sparse Matrices and their Applications The IBM Research Symposia Series, pp.157-166, 1972.
DOI : 10.1007/978-1-4615-8675-3_14

L. Dagum and R. Menon, OpenMP: an industry standard API for shared-memory programming, IEEE Computational Science and Engineering, vol.5, issue.1, pp.46-55, 1998.
DOI : 10.1109/99.660313

M. De-wael, S. Marr, B. De-fraine, T. Van-cutsem, and W. D. Meuter, Partitioned Global Address Space Languages, ACM Computing Surveys, vol.47, issue.4, p.29, 2016.
DOI : 10.1145/2716320

URL : https://hal.archives-ouvertes.fr/hal-01109405

. Vassiliosv, . Dimakopoulos, . Panagiotise, G. Hadjidoukas, and . Philos, A microbenchmark study of openmp overheads under nested parallelism, OpenMP in a New Era of Parallelism, pp.1-12, 2008.

J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha, Scalable work stealing, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pp.1-53, 2009.
DOI : 10.1145/1654059.1654113

U. S. Dixit and P. M. Dixit, A study on residual stresses in rolling, International Journal of Machine Tools and Manufacture, vol.37, issue.6, pp.837-853, 1997.
DOI : 10.1016/S0890-6955(96)00052-1

F. Laurent-d-'orazio, C. Jouanot, C. Labbé, and . Roncancio, Caches sémantiques coopératifs pour la gestion de données sur grilles, 2007.

M. Dubois and F. A. Briggs, Effects of cache coherency in multiprocessors . Computers, IEEE Transactions, issue.11, pp.311083-1099, 1982.

S. J. Eggers and R. H. Katz, Evaluating the performance of four snooping cache coherency protocols, ACM SIGARCH Computer Architecture News, vol.17, issue.3, pp.2-15, 1989.
DOI : 10.1145/74926.74927

E. , A. Emerson, and V. Kahlon, Exact and efficient verification of parameterized cache coherence protocols, Correct Hardware Design and Verification Methods, pp.247-262, 2003.
DOI : 10.1007/978-3-540-39724-3_22

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.4629

E. , A. Emerson, and V. Kahlon, Rapid parameterized model checking of snoopy cache coherence protocols Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science, vol.2619, pp.144-159, 2003.

T. Vincent-danjean-fabien-le-mentec and . Gautier, The x-kaapi? application programming interface. part i : Data flow programming, 2011.

V. Faucher, Reduction methods for fast transient structural dynamics applicated to the analysis of complex structures under impact
URL : https://hal.archives-ouvertes.fr/tel-01018792

V. Faucher, Advanced parallel strategy for strongly coupled fast transient fluid-structure dynamics with dual management of kinematic constraints Advances in Engineering Software, pp.70-89, 2014.

V. Faucher, Numerical methods and parallel algorithms for fast transient strongly coupled fluid-structure dynamics. Habilitation à diriger des recherches, 2014.
DOI : 10.1016/j.advengsoft.2013.08.002

URL : https://hal.archives-ouvertes.fr/tel-01011205

W. Feinstein and M. Brylinski, Structure-Based Drug Discovery Accelerated by Many-Core Devices, Current Drug Targets, vol.17, issue.14, 2016.
DOI : 10.2174/1389450117666160112112854

URL : http://doi.org/10.2174/1389450117666160112112854

M. Flynn, Some Computer Organizations and Their Effectiveness, IEEE Transactions on Computers, vol.21, issue.9, pp.948-960, 1972.
DOI : 10.1109/TC.1972.5009071

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, Cache-oblivious algorithms, Foundations of Computer Science 40th Annual Symposium on, pp.285-297, 1999.

M. Frigo, C. E. Leiserson, and K. H. Randall, The implementation of the Cilk-5 multithreaded language, ACM SIGPLAN Notices, vol.33, issue.5, pp.212-223, 1998.
DOI : 10.1145/277652.277725

A. Fortin and A. Garon, Les éléments finis : de la théorie à la pratique, 2011.

T. Gautier, F. Lementec, V. Faucher, and B. Raffin, X-kaapi: A Multi Paradigm Runtime for Multicore Architectures, 2013 42nd International Conference on Parallel Processing, pp.728-735, 2013.
DOI : 10.1109/ICPP.2013.86

URL : https://hal.archives-ouvertes.fr/hal-00727827

T. Gautier, X. Besseron, and L. Pigeon, KAAPI, Proceedings of the 2007 international workshop on Parallel symbolic computation, PASCO '07, pp.15-23, 2007.
DOI : 10.1145/1278177.1278182

URL : https://hal.archives-ouvertes.fr/hal-00647474

P. Germain, Mecanique Tome I. Ecole polytechnique, 1986.

M. B. Giles, G. R. Mudalige, B. Spencer, C. Bertolli, and I. Reguly, Designing OP2 for GPU architectures, Journal of Parallel and Distributed Computing, vol.73, issue.11, pp.1451-1460, 2013.
DOI : 10.1016/j.jpdc.2012.07.008

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.259.5159

N. Govender, D. N. Wilke, S. Kok, and R. Els, Development of a convex polyhedral discrete element simulation framework for NVIDIA Kepler based GPUs, Fourth International Conference on Finite Element Methods in Engineering and Sciences, pp.386-400, 2013.
DOI : 10.1016/j.cam.2013.12.032

R. L. Graham, Bounds on Multiprocessing Timing Anomalies, SIAM Journal on Applied Mathematics, vol.17, issue.2, pp.416-429, 1969.
DOI : 10.1137/0117039

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.90.8131

Y. Guo, J. Zhao, V. Cave, and V. Sarkar, Slaw : A scalable locality-aware adaptive work-stealing scheduler for multi-core systems, Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pp.341-342, 2010.

D. Hendler and N. Shavit, Non-blocking steal-half work queues, Proceedings of the twenty-first annual symposium on Principles of distributed computing , PODC '02, pp.280-289, 2002.
DOI : 10.1145/571825.571876

M. Herlihy and N. Shavit, The art of multiprocessor programming, Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing , PODC '06, 2008.
DOI : 10.1145/1146381.1146382

H. D. Hibbitt, P. V. Marcal, and J. R. Rice, A finite element formulation for problems of large strain and large displacement, International Journal of Solids and Structures, vol.6, issue.8, pp.1069-1086, 1970.
DOI : 10.1016/0020-7683(70)90048-X

N. Paul, D. O. Hilfinger, K. Bonachea, D. Datta, . Gay et al., Titanium language reference manual, version 2.19, 2005.

M. D. Hill and A. J. Smith, Evaluating associativity in cpu caches. Computers, IEEE Transactions on, vol.38, issue.12, pp.1612-1630, 1989.
DOI : 10.1109/12.40842

W. Hu, W. Shi, and Z. Tang, JIAJIA: A software DSM system based on a new cache coherence protocol, High-Performance Computing and Networking, pp.461-472, 1999.
DOI : 10.1007/BFb0100607

P. Huerre, Mécanique des fluides Tome 1, 1998.

C. Alexandru, M. Iordan, L. Jahre, and . Natvig, Tuning the victim selection policy of intel {TBB}, Journal of Systems Architecture, 2015.

C. Seung-jai-min, K. Iancu, and . Yelick, Hierarchical work stealing on manycore clusters, Fifth Conference on Partitioned Global Address Space Programming Models, 2011.

M. Kandemir, . Choudhary, P. Ramanujam, and . Banerjee, Optimizing spatial locality in loop nests using linear algebra, Proc. 7th Workshop Compilers for Parallel Computers, p.430, 1998.
DOI : 10.1145/277830.277849

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.1030

T. Karcher, C. Schaefer, and V. Pankratius, Auto-tuning support for manycore applications-perspectives for operating systems and compilers
DOI : 10.1145/1531793.1531808

K. Kennedy, C. Koelbel, and H. Zima, The rise and fall of High Performance Fortran, Proceedings of the third ACM SIGPLAN conference on History of programming languages , HOPL III, pp.7-8, 2007.
DOI : 10.1145/1238844.1238851

R. E. Ladner, R. Fortna, and B. Nguyen, A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation, 2002.
DOI : 10.1007/3-540-36383-1_4

J. Lee and M. Sato, Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems, 2010 39th International Conference on Parallel Processing Workshops, pp.413-420, 2010.
DOI : 10.1109/ICPPW.2010.62

S. Léger, MMéthode Lagrangienne actualisée pour des problémes hyperélastiques en trés grandes déformations, 2014.

A. Legrand and Y. Robert, Algorithmique Parallèle ? Cours Et Exercices Corrigés. Dunod, 2003.

D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, The directory-based cache coherence protocol for the dash multiprocessor, Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA '90, pp.148-159, 1990.

W. Li and K. Pingali, A singular loop transformation framework based on non-singular matrices, 1993.

C. Lin and L. Snyder, ZPL: An array sublanguage, Languages and Compilers for Parallel Computing, pp.96-114, 1994.
DOI : 10.1007/3-540-57659-2_6

W. Liu and A. H. Sherman, Comparative Analysis of the Cuthill???McKee and the Reverse Cuthill???McKee Ordering Algorithms for Sparse Matrices, SIAM Journal on Numerical Analysis, vol.13, issue.2, pp.198-213, 1976.
DOI : 10.1137/0713020

B. David and . Loveman, High performance fortran. Parallel & Distributed Technology : Systems & Applications, IEEE, vol.1, issue.1, pp.25-42, 1993.

N. Mahjoubi, Méthode générale de couplage de schéma d'intégration multiéchelles en temps en dynamique des structures, 2010.

M. Mccool, J. Reinders, and A. Robison, Structured Parallel Programming : Patterns for Efficient Computation, 2012.

B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz, Analysis of the clustering properties of the hilbert space-filling curve. Knowledge and Data Engineering, IEEE Transactions on, vol.13, issue.1, pp.124-141, 2001.

E. Hastings and M. , On certain crinkly curves, Transactions of the American Mathematical Society, vol.1, issue.1, pp.72-90, 1900.

G. E. Moore, Cramming More Components Onto Integrated Circuits, Proceedings of the IEEE, pp.82-85, 1998.
DOI : 10.1109/JPROC.1998.658762

S. Moreaud and B. Goglin, Impact of NUMA Effects on High- Speed Networking with Multi-Opteron Machines, PDCS, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00175747

P. Germain and P. Muller, Introduction a la mecanique des milieux continus, 1980.

P. J. Needham, A. Bhuiyan, and R. C. Walker, Extension of the AMBER molecular dynamics software to Intel???s Many Integrated Core (MIC) architecture, Computer Physics Communications, vol.201, 2016.
DOI : 10.1016/j.cpc.2015.12.025

M. Nathan and . Newmark, A method of computation for structural dynamics, Journal of the Engineering Mechanics Division, vol.85, issue.3, pp.67-94, 1959.

. Giap-nguyen-nguyen, Spacer-filling curves and their application in image processing. Theses, 2013.

D. Novillo, Openmp and automatic parallelization in gcc, the Proceedings of the GCC Developers, 2006.

W. Robert, J. Numrich, and . Reid, Co-array fortran for parallel programming, SIGPLAN Fortran Forum, vol.17, issue.2, pp.1-31, 1998.

M. Palyart, A Model-Based Approach for the Development of High- Performance Scientific Computing Software. Theses, 2012.
URL : https://hal.archives-ouvertes.fr/tel-00865535

A. David, J. L. Patterson, and . Hennessy, Computer Architecture : A Quantitative Approach, 1990.

S. Philippe, Development of an Arbitrary Lagrangian Eulerian (ALE) formulation for the 3D simulation of flat rolling, 2009.
URL : https://hal.archives-ouvertes.fr/tel-00431051

L. Laércio, C. P. Pilla, D. Ribeiro, A. Cordeiro, . Bhatele et al., Improving parallel system performance with a numa-aware load balancer, 2011.

J. Quintin, Dynamic Load-Balancing on Hierarchical Platforms. Theses, 2011.
URL : https://hal.archives-ouvertes.fr/tel-00661447

D. Raffin, History based work-stealing for dynamic numerical simulations, 2011.

H. Keith and . Randall, Cilk : Efficient Multithreaded Computing, 1998.

J. Reinders, Intel Threading Building Blocks, 2007.

C. P. Ribeiro, J. Mehaut, A. Carissimi, M. Castro, and L. G. Fernandes, Memory Affinity for Hierarchical Shared Memory Multiprocessors, 2009 21st International Symposium on Computer Architecture and High Performance Computing, pp.59-66, 2009.
DOI : 10.1109/SBAC-PAD.2009.16

URL : https://hal.archives-ouvertes.fr/hal-00788914

A. Robison, M. Voss, and A. Kukanov, Optimization via Reflection on Work
DOI : 10.1109/ipdps.2008.4536188

H. Sagan, Space-filling curves, 2012.
DOI : 10.1007/978-1-4612-0871-6

I. J. Schoenberg, On the Peano curve of Lebesgue, Bulletin of the American Mathematical Society, vol.44, issue.8, p.519, 1938.
DOI : 10.1090/S0002-9904-1938-06792-4

E. James, J. R. Smith, and . Goodman, A study of instruction cache organizations and replacement policies, SIGARCH Comput. Archit. News, vol.11, issue.3, pp.132-137, 1983.

W. Stallings, Organisation et architecture de l'ordinateur. Imp. la source d'or, 2003.

C. Stephen, An Introduction to theroritical fluid machanics, 2000.

G. Strang and G. J. Fix, An Analysis of the Finite-Element Method, Journal of Applied Mechanics, vol.41, issue.1, 1973.
DOI : 10.1115/1.3423272

T. Suh, D. M. Blough, H. , and S. Lee, Supporting cache coherence in heterogeneous multiprocessor systems, Proceedings Design, Automation and Test in Europe Conference and Exhibition, 2003.
DOI : 10.1109/DATE.2004.1269047

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.1306

V. S. Sunderam, PVM: A framework for parallel distributed computing, Concurrency: Practice and Experience, vol.4, issue.4, pp.315-339, 1990.
DOI : 10.1002/cpe.4330020404

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.2880

Y. Tanaka, K. Taura, M. Sato, and A. Yonezawa, Performance Evaluation of OpenMP Applications with Nested Parallelism, Languages, Compilers, and Run-Time Systems for Scalable Computers, pp.100-112, 1915.
DOI : 10.1007/3-540-40889-4_8

M. Tchiboukdjian, Algorithmes parallèle efficace en Cache : Application à la visualisation scientifique, 2010.

F. Vincent-faucher-bruno-raffin-thierry-gautier and . Lementec, X-kaapi : a multi paradigm runtime for multicore architectures

X. Tian, J. P. Hoeflinger, G. Haab, Y. Chen, M. Girkar et al., A compiler for exploiting nested parallelism in OpenMP programs, Parallel Computing, vol.31, issue.10-12, pp.10-12960, 2005.
DOI : 10.1016/j.parco.2005.03.007

A. Tousimojarad and W. Vanderbauwhede, Steal Locally, Share Globally, International Journal of Parallel Programming, vol.18, issue.4, pp.894-917, 2015.
DOI : 10.1007/s10766-015-0350-0

D. Traoré, Self-adaptive parallel algorithms and applications. Theses, Institut National Polytechnique de Grenoble -INPG, 2008.

J. Treibig, G. Hager, and G. Wellein, LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments, 2010 39th International Conference on Parallel Processing Workshops, pp.207-216, 2010.
DOI : 10.1109/ICPPW.2010.38

URL : http://arxiv.org/abs/1004.4431

. John-von-neumann, Introduction to the first draft report on the edvac, 1945.

I. Wald, Fast construction of sah bvhs on the intel many integrated core (mic) architecture. Visualization and Computer Graphics, IEEE Transactions on, vol.18, issue.1, pp.47-57, 2012.

D. W. Walker, D. W. Walker, J. J. Dongarra, and J. J. Dongarra, Mpi : A standard message passing interface, pp.56-68, 1996.

M. V. Wilkes, Slave memories and dynamic storage allocation. Electronic Computers, IEEE Transactions, issue.2, pp.14270-271, 1965.
DOI : 10.1109/pgec.1965.264263

E. Michael, M. S. Wolf, and . Lam, A data locality optimizing algorithm, pp.30-44, 1991.

M. Wolfe, High Performance Compilers for Parallel Computing, 1995.

I. Wu and H. T. Kung, Communication complexity for parallel divideand-conquer, Foundations of Computer Science Proceedings., 32nd Annual Symposium on, pp.151-162, 1991.
DOI : 10.1109/sfcs.1991.185364

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.6209