A. Anbar, O. Serres, E. Kayraklioglu, A. A. Badawy, and T. El-ghazawi, Exploiting Hierarchical Locality in Deep Parallel Architectures, ACM Transactions on Architecture and Code Optimization (TACO), vol.13, issue.2, pp.1-25, 2016.

R. Azimi, D. K. Tam, L. Soares, and M. Stumm, Enhancing Operating System Support for Multicore Processors by Using Hardware Performance Monitoring, ACM SIGOPS Operating Systems Review, vol.43, issue.2, pp.56-65, 2009.

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter et al., The NAS Parallel Benchmarks, International Journal of Supercomputer Applications, vol.5, issue.3, pp.66-73, 1991.

S. Bak, H. Menon, S. White, M. Diener, and L. Kale, Multi-level load balancing with an integrated runtime approach, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2018.

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight et al., Communication lower bounds and optimal algorithms for numerical linear algebra, Acta Numerica, vol.23, pp.1-155, 2014.

N. Barrow-williams, C. Fensch, and S. Moore, A Communication Characterisation of Splash-2 and Parsec, IEEE International Symposium on Workload Characterization (IISWC), pp.86-97, 2009.

C. Bienia, S. Kumar, J. P. Singh, and K. Li, The PARSEC Benchmark Suite: Characterization and Architectural Implications, International Conference on Parallel Architectures and Compilation Techniques (PACT), pp.72-81, 2008.

J. E. Boillat and P. G. Kropf, A Fast Distributed Mapping Algorithm, Joint International Conference on Vector and Parallel Processing (CON-PAR 90 -VAPP IV), pp.405-416, 1990.

S. Bokhari, On the Mapping Problem, IEEE Transactions on Computers, C, vol.30, issue.3, pp.207-214, 1981.

B. Brandfass, T. Alrutz, and T. Gerhold, Rank reordering for MPI communication optimization, Computers & Fluids, vol.80, pp.372-380, 2013.

F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP), pp.180-186, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00429889

H. Chen, W. Chen, J. Huang, B. Robert, and H. Kuhn, MPIPP: An Automatic Profile-guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp.353-360, 2006.

C. Chevalier and F. Pellegrini, PT-Scotch: A Tool for Efficient Parallel Graph Ordering, Parallel Computing, vol.34, issue.6-8, pp.318-331, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00410427

E. H. Cruz, M. Diener, M. A. Alves, and P. O. Navaux, Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols, Journal of Parallel and Distributed Computing (JPDC), vol.74, issue.3, pp.2215-2228, 2014.

E. H. Cruz, M. Diener, L. L. Pilla, and P. O. Navaux, An Efficient Algorithm for Communication-Based Task Mapping, International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp.207-214, 2015.

E. H. Cruz, M. Diener, L. L. Pilla, and P. O. Navaux, Hardwareassisted thread and data mapping in hierarchical multicore architectures, ACM Trans. Archit. Code Optim, vol.13, issue.3, 2016.

E. H. Cruz, M. Diener, M. S. Serpa, P. O. Navaux, L. Pilla et al., Improving communication and load balancing with thread mapping in manycore systems, Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp.93-100, 2018.

W. J. Dally, GPU Computing to Exascale and Beyond, 2010.

M. Deveci, K. Kaya, B. Ucar, and U. V. Catalyurek, Fast and High Quality Topology-Aware Task Mapping, IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.197-206, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01159677

K. D. Devine, E. G. Boman, R. T. Heaphy, R. H. Bisseling, and U. V. Catalyurek, Parallel hypergraph partitioning for scientific computing, IEEE International Parallel & Distributed Processing Symposium (IPDPS), pp.124-133, 2006.

J. Edmonds, Maximum matching and a polyhedron with 0,1-vertices, Journal of Research of the National Bureau of Standards -Section B. Mathematics and Mathematical Physics, 69B(1 and, p.125, 1965.

R. Glantz, H. Meyerhenke, and A. Noe, Algorithms for Mapping Parallel Processes onto Grid and Torus Architectures, 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp.236-243, 2015.

B. Hendrickson and R. Lelandy, The chaco users guide version 2.0, 1995.

S. Ito, K. Goto, and K. Ono, Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments, Computers & Fluids, vol.80, pp.88-93, 2013.

E. Jeannot and G. Mercier, Near-optimal placement of MPI processes on hierarchical NUMA architectures, Euro-Par Parallel Processing, pp.199-210, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00544346

E. Jeannot, G. Mercier, and F. Tessier, Process Placement in Multicore Clusters: Algorithmic Issues and Practical Techniques, IEEE Transactions on Parallel and Distributed Systems, vol.25, issue.4, pp.993-1002, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00803548

E. Jeannot, G. Mercier, and F. Tessier, Topology and Affinity Aware Hierarchical and Distributed Load-balancing in Charm++, Workshop on Optimization of Communication in HPC (COM-HPC), pp.63-72, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01394748

H. Jin, M. Frumkin, and J. Yan, The OpenMP implementation of NAS Parallel Benchmarks and Its Performance, 1999.

M. T. Jones, Inside the linux 2.6 completely fair scheduler, 2009.

Z. Jovanovic and S. Maric, A heuristic algorithm for dynamic task scheduling in highly parallel computing systems, Future Generation Computer Systems, vol.17, issue.6, pp.721-732, 2001.

G. Karypis and V. Kumar, Metis -unstructured graph partitioning and sparse matrix ordering system, version 2.0, 1995.

G. Karypis and V. Kumar, Parallel Multilevel K-way Partitioning Scheme for Irregular Graphs, ACM/IEEE Conference on Supercomputing, pp.1-21, 1996.

G. Karypis and V. Kumar, A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM J. Sci. Comput, vol.20, issue.1, pp.359-392, 1998.

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser et al., Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, In ACM SIGPLAN Conference on Programming Language Design and Implementation, pp.190-200, 2005.

H. Luo, P. Li, and C. Ding, Thread data sharing in cache: Theory and measurement, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp.103-115, 2017.

P. Luszczek, J. J. Dongarra, D. Koester, R. Rabenseifer, B. Lucas et al., Introduction to the HPC Challenge Benchmark Suite, 2005.

L. Mcvoy and C. Staelin, Lmbench: Portable Tools for Performance Analysis, USENIX Annual Technical Conference (ATC), pp.23-38, 1996.

F. Pellegrini, Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs, Scalable High-Performance Computing Conference (SHPCC), pp.486-493, 1994.

J. Shalf, S. Dosanjh, and J. Morrison, Exascale Computing Technology Challenges, High Performance Computing for Computational Science (VECPAR), pp.1-25, 2010.

J. Sonnek, J. Greensky, R. Reutiman, and A. Chandra, Starling: Minimizing Communication Overhead in Virtualized Computing Platforms Using Decentralized Affinity-Aware Migration, International Conference on Parallel Processing (ICPP), pp.228-237, 2010.

F. Trahay, F. Rue, M. Faverge, Y. Ishikawa, R. Namyst et al., EZTrace: a generic framework for performance analysis, International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp.618-619, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00587216

D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham et al., Trends in data locality abstractions for hpc systems, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol.28, issue.10, pp.3007-3020, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01621371

W. Wang, T. Dey, J. Mars, L. Tang, J. W. Davidson et al., Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources, IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), 2012.

J. Zhai, T. Sheng, and J. He, Efficiently Acquiring Communication Traces for Large-Scale Parallel Applications, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol.22, pp.1862-1870, 2011.