Exploiting Hierarchical Locality in Deep Parallel Architectures, ACM Transactions on Architecture and Code Optimization (TACO), vol.13, issue.2, pp.1-25, 2016. ,
Enhancing Operating System Support for Multicore Processors by Using Hardware Performance Monitoring, ACM SIGOPS Operating Systems Review, vol.43, issue.2, pp.56-65, 2009. ,
The NAS Parallel Benchmarks, International Journal of Supercomputer Applications, vol.5, issue.3, pp.66-73, 1991. ,
Multi-level load balancing with an integrated runtime approach, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2018. ,
Communication lower bounds and optimal algorithms for numerical linear algebra, Acta Numerica, vol.23, pp.1-155, 2014. ,
A Communication Characterisation of Splash-2 and Parsec, IEEE International Symposium on Workload Characterization (IISWC), pp.86-97, 2009. ,
The PARSEC Benchmark Suite: Characterization and Architectural Implications, International Conference on Parallel Architectures and Compilation Techniques (PACT), pp.72-81, 2008. ,
A Fast Distributed Mapping Algorithm, Joint International Conference on Vector and Parallel Processing (CON-PAR 90 -VAPP IV), pp.405-416, 1990. ,
On the Mapping Problem, IEEE Transactions on Computers, C, vol.30, issue.3, pp.207-214, 1981. ,
Rank reordering for MPI communication optimization, Computers & Fluids, vol.80, pp.372-380, 2013. ,
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP), pp.180-186, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00429889
MPIPP: An Automatic Profile-guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp.353-360, 2006. ,
PT-Scotch: A Tool for Efficient Parallel Graph Ordering, Parallel Computing, vol.34, issue.6-8, pp.318-331, 2008. ,
URL : https://hal.archives-ouvertes.fr/hal-00410427
Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols, Journal of Parallel and Distributed Computing (JPDC), vol.74, issue.3, pp.2215-2228, 2014. ,
An Efficient Algorithm for Communication-Based Task Mapping, International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp.207-214, 2015. ,
Hardwareassisted thread and data mapping in hierarchical multicore architectures, ACM Trans. Archit. Code Optim, vol.13, issue.3, 2016. ,
Improving communication and load balancing with thread mapping in manycore systems, Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp.93-100, 2018. ,
GPU Computing to Exascale and Beyond, 2010. ,
Fast and High Quality Topology-Aware Task Mapping, IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.197-206, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01159677
Parallel hypergraph partitioning for scientific computing, IEEE International Parallel & Distributed Processing Symposium (IPDPS), pp.124-133, 2006. ,
Maximum matching and a polyhedron with 0,1-vertices, Journal of Research of the National Bureau of Standards -Section B. Mathematics and Mathematical Physics, 69B(1 and, p.125, 1965. ,
Algorithms for Mapping Parallel Processes onto Grid and Torus Architectures, 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp.236-243, 2015. ,
The chaco users guide version 2.0, 1995. ,
Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments, Computers & Fluids, vol.80, pp.88-93, 2013. ,
Near-optimal placement of MPI processes on hierarchical NUMA architectures, Euro-Par Parallel Processing, pp.199-210, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00544346
Process Placement in Multicore Clusters: Algorithmic Issues and Practical Techniques, IEEE Transactions on Parallel and Distributed Systems, vol.25, issue.4, pp.993-1002, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00803548
Topology and Affinity Aware Hierarchical and Distributed Load-balancing in Charm++, Workshop on Optimization of Communication in HPC (COM-HPC), pp.63-72, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01394748
The OpenMP implementation of NAS Parallel Benchmarks and Its Performance, 1999. ,
Inside the linux 2.6 completely fair scheduler, 2009. ,
A heuristic algorithm for dynamic task scheduling in highly parallel computing systems, Future Generation Computer Systems, vol.17, issue.6, pp.721-732, 2001. ,
Metis -unstructured graph partitioning and sparse matrix ordering system, version 2.0, 1995. ,
Parallel Multilevel K-way Partitioning Scheme for Irregular Graphs, ACM/IEEE Conference on Supercomputing, pp.1-21, 1996. ,
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs, SIAM J. Sci. Comput, vol.20, issue.1, pp.359-392, 1998. ,
Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, In ACM SIGPLAN Conference on Programming Language Design and Implementation, pp.190-200, 2005. ,
Thread data sharing in cache: Theory and measurement, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp.103-115, 2017. ,
Introduction to the HPC Challenge Benchmark Suite, 2005. ,
Lmbench: Portable Tools for Performance Analysis, USENIX Annual Technical Conference (ATC), pp.23-38, 1996. ,
Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs, Scalable High-Performance Computing Conference (SHPCC), pp.486-493, 1994. ,
Exascale Computing Technology Challenges, High Performance Computing for Computational Science (VECPAR), pp.1-25, 2010. ,
Starling: Minimizing Communication Overhead in Virtualized Computing Platforms Using Decentralized Affinity-Aware Migration, International Conference on Parallel Processing (ICPP), pp.228-237, 2010. ,
EZTrace: a generic framework for performance analysis, International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp.618-619, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00587216
Trends in data locality abstractions for hpc systems, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol.28, issue.10, pp.3007-3020, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01621371
Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources, IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), 2012. ,
Efficiently Acquiring Communication Traces for Large-Scale Parallel Applications, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol.22, pp.1862-1870, 2011. ,