Clock rate versus IPC: the end of the road for conventional microarchitectures, Computer Architecture Proceedings of the 27th International Symposium on, pp.248-259, 2000. ,
Task-Based FMM for Multicore Architectures, SIAM Journal on Scientific Computing, vol.36, issue.1, pp.66-93, 2014. ,
DOI : 10.1137/130915662
URL : https://hal.archives-ouvertes.fr/hal-00807368
Abstract Machine Models and Proxy Architectures for Exascale Computing, 2014 Hardware-Software Co-Design for High Performance Computing, 2014. ,
DOI : 10.1109/Co-HPC.2014.4
The PERCS High-Performance Interconnect, 2010 18th IEEE Symposium on High Performance Interconnects, 2010. ,
DOI : 10.1109/HOTI.2010.16
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, vol.23, issue.4, pp.187-198, 2011. ,
DOI : 10.1002/cpe.1631
URL : https://hal.archives-ouvertes.fr/inria-00384363
The cosmo priority project 'conservative dynamical core' final report, 2013. ,
The OmpSs Programming Model ,
Cilk: An Efficient Multithreaded Runtime System, Proceedings of PPoPP '95, 1995. ,
DOI : 10.1006/jpdc.1996.0107
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3175
Zoltan2: Next generation combinatorial toolkit, 2012. ,
Thousand core chips, Proceedings of the 44th annual conference on Design automation, DAC '07, pp.746-749, 2007. ,
DOI : 10.1145/1278480.1278667
The future of microprocessors, Communications of the ACM, vol.54, issue.5, pp.67-77, 2011. ,
DOI : 10.1145/1941487.1941507
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp.1432-1441, 2011. ,
DOI : 10.1109/IPDPS.2011.299
PaRSEC: Exploiting Heterogeneity to Enhance Scalability, Computing in Science & Engineering, vol.15, issue.6, pp.36-45, 2013. ,
DOI : 10.1109/MCSE.2013.98
The lustre storage architecture, 2003. ,
Effects of Flow Instabilities on the Linear Analysis of Turbomachinery Aeroelasticity, Journal of Propulsion and Power, vol.19, issue.2, pp.250-259, 2014. ,
DOI : 10.2514/2.6106
A batch scheduler with high level components, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005., pp.776-783, 2005. ,
DOI : 10.1109/CCGRID.2005.1558641
URL : https://hal.archives-ouvertes.fr/hal-00005106
PVFS: A parallel file system for linux clusters, Proceedings of the 4th Annual Linux Showcase and Conference, pp.317-327, 2000. ,
Sejits: Getting productivity and performance with selective embedded jit specialization, 2009. ,
Copperhead: compiling an embedded data parallel language, Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, pp.47-56, 2011. ,
SuperMatrix, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming , PPoPP '08, 2008. ,
DOI : 10.1145/1345206.1345227
Pipelining Computational Stages of the Tomographic Reconstructor for Multi-Object Adaptive Optics on a Multi-GPU System, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 2014. ,
DOI : 10.1109/SC.2014.27
X10, ACM SIGPLAN Notices, vol.40, issue.10, pp.519-538, 2005. ,
DOI : 10.1145/1103845.1094852
URL : https://hal.archives-ouvertes.fr/in2p3-00166974
MPIPP, Proceedings of the 20th annual international conference on Supercomputing , ICS '06, pp.353-360, 2006. ,
DOI : 10.1145/1183401.1183451
Scheduling threads for constructive cache sharing on CMPs, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures , SPAA '07, 2007. ,
DOI : 10.1145/1248377.1248396
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.146.3374
Chombo software package for AMR applications, 2000. ,
Openmp 4.0 application program interface, 2013. ,
Accelerating CyberShake Calculations on the XE6/XK7 Platform of Blue Waters, 2013 Extreme Scaling Workshop (xsw 2013), pp.8-17, 2013. ,
DOI : 10.1109/XSW.2013.6
Exploiting Geometric Partitioning in Task Mapping for Parallel Computers, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014. ,
DOI : 10.1109/IPDPS.2014.15
Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.496-507, 2012. ,
DOI : 10.1109/IPDPS.2012.53
URL : https://hal.archives-ouvertes.fr/inria-00631348
CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014. ,
DOI : 10.1109/IPDPS.2014.27
URL : https://hal.archives-ouvertes.fr/hal-00916091
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns, Journal of Parallel and Distributed Computing, 2014. ,
SugarJ, ACM SIGPLAN Notices, vol.46, issue.10, pp.391-406, 2011. ,
DOI : 10.1145/2076021.2048099
Scotch and LibScotch 5.1 User's Guide. ScAlApplix project, 2008. ,
URL : https://hal.archives-ouvertes.fr/hal-00410332
Grid tools: Towards a library for hardware oblivious implementation of stencil based codes ,
DASH: Data Structures and Algorithms with Support for Hierarchical Locality, Euro-Par Workshops, 2014. ,
DOI : 10.1007/978-3-319-14313-2_46
Designing a unified programming model for heterogeneous machines, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012. ,
DOI : 10.1109/SC.2012.48
The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, pp.702-719, 2010. ,
Managing the topology of heterogeneous cluster nodes with hardware locality (hwloc), 2014 International Conference on High Performance Computing & Simulation (HPCS), 2014. ,
DOI : 10.1109/HPCSim.2014.6903671
URL : https://hal.archives-ouvertes.fr/hal-00985096
Netloc: Towards a Comprehensive View of the HPC System Topology, 2014 43rd International Conference on Parallel Processing Workshops, 2014. ,
DOI : 10.1109/ICPPW.2014.38
URL : https://hal.archives-ouvertes.fr/hal-01010599
Job scheduling under the portable batch system, Job scheduling strategies for parallel processing, pp.279-294, 1995. ,
GROMACS 4:?? Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation, Journal of Chemical Theory and Computation, vol.4, issue.3, pp.435-447, 2008. ,
DOI : 10.1021/ct700301q
Chapter 5: An overview of process mapping techniques and algorithms in high-performance computing, High Performance Computing on Complex Environments, pp.65-84, 2014. ,
Generic topology mapping strategies for large-scale parallel architectures, Proceedings of the international conference on Supercomputing, ICS '11, pp.75-84, 2011. ,
DOI : 10.1145/1995896.1995909
Building domain-specific embedded languages [47] hwloc. Portable Hardware Locality, ACM Computing Surveys, vol.28, 1996. ,
Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures, Euro-Par 2010 -Parallel Processing, 16th International Euro-Par Conference, pp.199-210, 2010. ,
DOI : 10.1007/978-3-642-15291-7_20
URL : https://hal.archives-ouvertes.fr/inria-00544346
Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques, IEEE Transactions on Parallel and Distributed Systems, vol.25, issue.4, pp.993-1002, 2014. ,
DOI : 10.1109/TPDS.2013.104
URL : https://hal.archives-ouvertes.fr/hal-00803548
Charm++: A portable concurrent object oriented system based on c++, Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA '93, pp.91-108, 1993. ,
Hierarchical Computation in the SPMD Programming Model, The 26th International Workshop on Languages and Compilers for Parallel Computing, 2013. ,
DOI : 10.1007/978-3-319-09967-5_1
Parmetis. Parallel graph partitioning and sparse matrix ordering library, 2003. ,
QUARK Users' Guide: QUeueing And Runtime for Kernels, 2011. ,
Technology-driven, highly-scalable dragonfly topology, Computer Architecture, 2008. ISCA '08. 35th International Symposium on, pp.77-88, 2008. ,
DOI : 10.1109/isca.2008.19
PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation, Parallel Computing, vol.38, issue.3, pp.157-174, 2012. ,
DOI : 10.1016/j.parco.2011.09.001
ExaScale computing study: Technology challenges in achieving exascale systems, 2008. ,
Exascale computing trends: Adjusting to the " new normal " ' for computer architecture, Computing in Science and Engineering, vol.15, issue.6, pp.16-26, 2013. ,
Is petascale completely done? what should we do now? joint-lab on petsacale computing workshophttps ,
Using Compiler Directives to Port Large Scientific Applications to GPUs: An Example from Atmospheric Science, Parallel Processing Letters, vol.24, issue.01, p.2014 ,
DOI : 10.1142/S0129626414500030
Parallel netCDF, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC '03, 2003. ,
DOI : 10.1145/1048935.1050189
Data-Driven Execution of Fast Multipole Methods. CoRR, abs, 1203. ,
Generating devicespecific GPU code for local operators in medical imaging, Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.569-581, 2012. ,
Towards Domain-Specific Computing for Stencil Codes in HPC, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp.1133-1138, 2012. ,
DOI : 10.1109/SC.Companion.2012.136
Scalable large-scale fluid-structure interaction solvers in the Uintah framework via hybrid task-based parallelism algorithms, Concurrency and Computation: Practice and Experience, vol.90, issue.3, pp.1388-1407, 2014. ,
DOI : 10.1002/cpe.3099
Design and implementation of a customizable work stealing scheduler, Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS '13, 2013. ,
DOI : 10.1145/2491661.2481433
OpenMP task scheduling strategies for multicore NUMA systems, International Journal of High Performance Computing Applications, vol.26, issue.2, pp.110-124, 2012. ,
OpenMP Application Program Interface ,
STELLA: A domain-specific language for stencil methods on structured grids, Poster Presentation at the Platform for Advanced Scientific Computing (PASC) Conference ,
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS, Lecture Notes in Computer Science, p.page in press, 2014. ,
DOI : 10.1007/978-3-319-15976-8_1
Scalable analysis of multicore data reuse and sharing, Proceedings of the 28th ACM international conference on Supercomputing, ICS '14, 2014. ,
DOI : 10.1145/2597652.2597674
Efficient Task Placement and Routing in Dragonfly Networks, Proceedings of the 23rd ACM International Symposium on High-Performance Parallel and Distributed Computing, 2014. ,
GRO- MACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit, Bioinformatics, issue.7, pp.29845-854, 2013. ,
Modeling communication in cache-coherent SMP systems, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.97-108, 2013. ,
DOI : 10.1145/2493123.2462916
PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp.1116-1123, 2012. ,
DOI : 10.1109/SC.Companion.2012.134
Multicore Aware Process Mapping and its Impact on Communication Overhead of Parallel Applications, Proceedings of the IEEE Symp. on Comp. and Comm, pp.811-817, 2009. ,
Lightweight modular staging, Communications of the ACM, vol.55, issue.6, pp.121-130, 2012. ,
DOI : 10.1145/2184319.2184345
DESOLA: An active linear algebra library using delayed evaluation and runtime code generation, Science of Computer Programming, vol.76, issue.4, pp.227-242, 2011. ,
DOI : 10.1016/j.scico.2008.06.002
GPFS: A shared-disk file system for large computing clusters, First USENIX Conference on File and Storage Technologies (FAST'02), 2002. ,
Exascale Computing Technology Challenges, International Meeting on High Performance Computing for Computational Science, pp.1-25, 2010. ,
DOI : 10.1109/MM.2009.5
Large scale system monitoring and analysis on blue waters using ovis, Proceedings of the 2014 Cray User's Group, 2014. ,
On implementing MPI-IO portably and with high performance, Proceedings of the sixth workshop on I/O in parallel and distributed systems , IOPADS '99, pp.23-32, 1999. ,
DOI : 10.1145/301816.301826
Tiling as a durable abstraction for parallelism and data locality. Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, 2013. ,
The libflame Library for Dense Matrix Computations, IEEE Des. Test, vol.11, issue.6, pp.56-63, 2009. ,
Active libraries: Rethinking the roles of compilers and libraries. CoRR, math, 1998. ,
Scalable performance of the Panasas parallel file system, Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST), pp.17-33, 2008. ,
Jesper Larsson Träff, and Philippas Tsigas. Work-stealing with configurable scheduling strategies, Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pp.315-316, 2013. ,
Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement, Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing, 2009. ,
DOI : 10.1007/978-3-642-13374-9_12
Titanium: A highperformance Java dialect, Workshop on Java for High-Performance Network Computing, 1998. ,
Applying Loop Optimizations to Object-Oriented Abstractions Through General Classification of Array Semantics, Lecture Notes in Computer Science, vol.3602, pp.253-267, 2004. ,
DOI : 10.1007/11532378_19
Slurm: Simple linux utility for resource management, Job Scheduling Strategies for Parallel Processing, pp.44-60, 2003. ,
UPC++: A PGAS Extension for C++, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014. ,
DOI : 10.1109/IPDPS.2014.115
Lsf: Load sharing in large heterogeneous distributed systems, I Workshop on Cluster Computing, 1992. ,