W. Feng and T. Scogland, Green500 list URL: https://www.green500. org/lists, 2017.
DOI : 10.1109/ipdpsw.2010.5470905

. Computer, URL: http://www.aics.riken, 2017.

A. Agelastos, B. A. Allan, and J. M. Brandt, The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.154-165, 2014.
DOI : 10.1109/SC.2014.18

E. Agullo, C. Augonnet, and J. Dongarra, LU factorization for accelerator-based systems, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pp.217-224, 2011.
DOI : 10.1109/AICCSA.2011.6126599

URL : https://hal.archives-ouvertes.fr/hal-00654193

E. Agullo, C. Augonnet, and J. Dongarra, QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 2011 IEEE International Parallel & Distributed Processing Symposium, pp.932-943, 2011.
DOI : 10.1109/IPDPS.2011.90

URL : https://hal.archives-ouvertes.fr/inria-00547614

C. Albing, Characterizing node orderings for improved performance, Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, PMBS '15, pp.1-6, 2015.
DOI : 10.2172/800975

S. Ashby, P. Beckman, and J. Chen, Opportunities and Challenges of Exascale Computing URL: https : / / science . energy . gov, Tech. rep. U.S. Department of Energy, 2010.

[. Augonnet, S. Thibault, and R. Namyst, Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures, Euro-Par Workshops. Lecture Notes in Computer Science, vol.6043, pp.56-65, 2009.
DOI : 10.1007/978-3-642-14122-5_9

URL : https://hal.archives-ouvertes.fr/inria-00421333

A. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience 23, pp.187-198, 2011.
DOI : 10.1007/978-3-642-03869-3_80

URL : https://hal.archives-ouvertes.fr/inria-00384363

[. Bampis, F. Guinand, and D. Trystram, Some models for scheduling parallel programs with communication delays, Discrete Applied Mathematics, vol.72, issue.1-2, pp.5-24, 1997.
DOI : 10.1016/S0166-218X(96)00034-0

URL : https://doi.org/10.1016/s0166-218x(96)00034-0

A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs, There goes the neighborhood, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, pp.1-4112, 2013.
DOI : 10.1145/2503210.2503247

J. B?a?-zewicz, K. H. Ecker, E. Pesh, G. Schmidt, and J. Weglarz, Handbook on Scheduling: From Theory to Applications. International Handbooks on Information Systems, 2007.

S. Bleuse, F. Kedad-sidhoum, G. Monna, D. Mounié, and . Trystram, Scheduling independent tasks on multi-cores with GPU accelerators, Concurrency and Computation: Practice and Experience, pp.16-1625, 2015.
DOI : 10.1007/s00607-003-0011-9

URL : https://hal.archives-ouvertes.fr/hal-01081625

R. Bleuse, S. Hunold, and S. Kedad-sidhoum, Scheduling Independent Moldable Tasks on Multi-Cores with GPUs, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.9, pp.2689-2702, 2017.
DOI : 10.1109/TPDS.2017.2675891

URL : https://hal.archives-ouvertes.fr/hal-01263100

G. Bosilca, A. Bouteiller, and A. Danalis, DAGuE: A generic distributed DAG engine for High Performance Computing, Parallel Computing, vol.38, issue.1-2, pp.37-51, 2012.
DOI : 10.1016/j.parco.2011.10.003

URL : http://www.netlib.org/lapack/lawnspdf/lawn231.pdf

M. Bougeret, P. Dutot, K. Jansen, C. Otte, and D. Trystram, A Fast 5/2-Approximation Algorithm for Hierarchical Scheduling, Euro-Par Lecture Notes in Computer Science, vol.17, issue.3, pp.157-167, 2010.
DOI : 10.1137/0217033

URL : https://hal.archives-ouvertes.fr/hal-00738518

A. Boukerche, J. M. Correa, A. C. , M. Alves-de-melo, and R. P. Jacobi, A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Approximation Algorithms for Multiple Strip Packing and Scheduling Parallel Jobs in Platforms, IEEE Transactions on Computers 59, pp.808-821, 2010.

[. Brent, The Parallel Evaluation of General Arithmetic Expressions, Journal of the ACM, vol.21, issue.2, pp.201-206, 1974.
DOI : 10.1145/321812.321815

URL : http://cr.yp.to/bib/1974/brent.pdf

P. Brucker, Scheduling Algorithms. Fifth Edition, 2007.

J. Bueno, J. Planas, and A. Duran, Productive Programming of GPU Clusters with OmpSs, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.557-568, 2012.
DOI : 10.1109/IPDPS.2012.58

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing, vol.35, issue.1, pp.38-53, 2009.
DOI : 10.1016/j.parco.2008.10.002

V. Bonifaci and A. Wiese, Scheduling Unrelated Machines of Few Different Types URL: https, pp.1205-0974, 2012.

H. Philip, K. Carns, W. E. Harms, and . Allcock, Understanding and Improving Computational Science Storage Access through Continuous Characterization, In: ACM Transactions on Storage, vol.7, issue.81, p.77, 2011.

. Aragon, Considering Time in Designing Large-Scale Systems for Scientific Computing, pp.1533-1545, 2016.

E. Grady-coffman-jr, R. Garey, D. S. Johnson, and R. E. Tarjan, Performance Bounds for Level-Oriented Two-Dimensional Packing Algorithms, SIAM Journal on Computing, vol.9, issue.4, pp.808-826, 1980.
DOI : 10.1137/0209062

E. David, R. M. Culler, D. A. Karp, and . Patterson, LogP: Towards a Realistic Model of Parallel Computation, pp.1-12, 1993.

L. Chen, D. Ye, and G. Zhang, Online Scheduling on a CPU-GPU Cluster, In: TAMC. Lecture Notes in Computer Science, vol.7876, pp.1-9, 2013.
DOI : 10.1007/978-3-642-38236-9_1

D. Deveci, S. Rajamanickam, and V. J. Leung, Exploiting Geometric Partitioning in Task Mapping for Parallel Computers, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp.16-27, 2014.
DOI : 10.1109/IPDPS.2014.15

URL : http://bmi.osu.edu/hpc/papers/Deveci14-IPDPS.pdf

P. Dutot, G. Mounié, and D. Trystram, Scheduling Parallel Tasks Approximation Algorithms In: Handbook of Scheduling: Algorithms , Models, and Performance Analysis, Computer & Information Science Series. Chapman and Hall/CRC, 2004.

[. Dongarra, P. H. Beckman, and T. Moore, The International Exascale Software Project roadmap, The International Journal of High Performance Computing Applications, vol.25, issue.1, pp.3-60, 2011.
DOI : 10.1088/1742-6596/180/1/012045

URL : http://www.exascale.org/mediawiki/images/2/20/IESP-roadmap.pdf

M. Dorier, S. Ibrahim, G. Antoniu, and R. B. Ross, Using Formal Grammars to Predict I/O Behaviors in HPC: The Omnisc'IO Approach, IEEE Transactions on Parallel and Distributed Systems, vol.27, issue.8, pp.2435-2449, 2016.
DOI : 10.1109/TPDS.2015.2485980

URL : https://hal.archives-ouvertes.fr/hal-01238103

M. Drozdowski, Scheduling for Parallel Processing Computer Communications and Networks, 2009.

A. C. Dusseau, D. E. Culler, K. E. Schauser, and R. P. Martin, Fast parallel sorting under LogP: experience with the CM-5, IEEE Transactions on Parallel and Distributed Systems, pp.791-805, 1996.
DOI : 10.1109/71.532111

URL : http://www.ece.eng.wayne.edu/~czxu/ece561/lnotes/logp-sort-paper.pdf

[. Evans, J. C. Browne, and W. L. Barth, Understanding Application and System Performance Through System-Wide Monitoring, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.1702-1710, 2016.
DOI : 10.1109/IPDPSW.2016.145

J. Enos, G. H. Bauer, and R. Brunner, Topology-Aware Job Scheduling Strategies for Torus Networks In: Cray User Group URL: https://cug.org/proceedings, pp.74-77, 2014.

L. Eyraud, URL: http://graal.ens-lyon. fr/~leyraudd/These/manuscrit.pdf. | cit An effective approximation algorithm for the Malleable Parallel Task Scheduling problem, In: Journal of Parallel and Distributed Computing, vol.72, issue.5, pp.40-693, 2006.

G. Dror, L. Feitelson, U. Rudolph, K. C. Schwiegelshohn, P. Sevcik et al., Theory and Practice in Parallel Job Scheduling, In: JSSPP. Lecture Notes in Computer Science, vol.1291, pp.1-34, 1997.

J. Vicente, F. Lima, T. Gautier, N. Maillard, and V. Danjean, Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs, pp.75-82, 2012.

K. Donald and . Friesen, Tighter Bounds for LPT Scheduling on Uniform Processors, In: SIAM Journal on Computing, vol.163, pp.554-560, 1987.

S. Fortune and J. Wyllie, Parallelism in random access machines, Proceedings of the tenth annual ACM symposium on Theory of computing , STOC '78, pp.114-118, 1978.
DOI : 10.1145/800133.804339

URL : http://ecommons.cornell.edu/bitstream/1813/7454/1/78-334.pdf

A. Gainaru, G. Aupy, and A. Benoit, Scheduling the I/O of HPC Applications Under Congestion, 2015 IEEE International Parallel and Distributed Processing Symposium, pp.1013-1022, 2015.
DOI : 10.1109/IPDPS.2015.116

URL : https://hal.archives-ouvertes.fr/hal-01251938

[. Gautier, J. Vicente-ferreira-lima, N. Maillard, B. Gautier, X. Besseron et al., XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures KAAPI: A thread scheduling runtime system for data flow computations on cluster of multiprocessors Topology-aware Resource Management for HPC Applications, DOI: 10.1109/IPDPS.2013.66. | cit, pp.1299-1308, 2007.

Y. Georgiou, Contributions for Resource and Job Management in High Performance Computing URL: https, 2010.

J. Gergov, Algorithms for Compile-Time Memory Optimization URL: https, In: SODA. ACM/SIAM, pp.907-908, 1999.

M. R. , G. , and R. L. Graham, Bounds for Multiprocessor Scheduling with Resource Constraints, In: SIAM Journal on Computing, vol.4, issue.2, pp.187-200, 1975.

M. R. , G. , and D. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. | cit, pp.82-86

R. Graham, E. L. Lawler, J. K. Lenstra, and A. Kan, Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey, Annals of Discrete Mathematics, vol.52, issue.08, pp.287-326, 1979.
DOI : 10.1016/S0167-5060(08)70356-X

URL : https://ir.cwi.nl/pub/18052/18052A.pdf

[. Hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard, Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations, Euro-Par Lecture Notes in Computer Science, vol.35, issue.3, pp.235-246, 2010.
DOI : 10.1007/s00224-002-1055-5

URL : https://hal.archives-ouvertes.fr/inria-00502448

D. Hilbert, Ueber die stetige Abbildung einer Line auf ein Flächenstück, Mathematische Annalen, vol.383, pp.459-460, 1891.
DOI : 10.1007/bf01199431

S. Dorit, D. B. Hochbaum, and . Shmoys, Using Dual Approximation Algorithms for Scheduling Problems: Theoretical and Practical Results, Journal of the ACM, vol.34, issue.23, pp.144-162, 1987.

S. Dorit, D. B. Hochbaum, and . Shmoys, A Polynomial Approximation Scheme for Scheduling on Uniform Processors: Using the Dual Approximation Approach, In: SIAM Journal on Computing, vol.173, pp.539-551, 1988.

F. Isaila, J. Carretero, and R. B. Ross, CLARISSE: A Middleware for Data-Staging Coordination and Control on Large-Scale HPC Platforms, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp.346-355, 2016.
DOI : 10.1109/CCGrid.2016.24

C. Imreh, Scheduling Problems on Two Sets of Identical Machines, Computing, vol.70, issue.4, pp.277-294, 2003.
DOI : 10.1007/s00607-003-0011-9

A. Jain, X. Bhatele, T. Ni, L. V. Gamblin, and . Kalé, Partitioning Low-Diameter Networks to Eliminate Inter-Job Interference, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.19-439, 2017.
DOI : 10.1109/IPDPS.2017.91

K. Jansen, L. Kathareios, C. Minkenberg, B. Prisacari, G. Rodríguez et al., URL: https Cost-Effective Diameter-Two Topologies: Analysis and Evaluation Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU, Linear-time Approximation Schemes for Scheduling Malleable Parallel Tasks, pp.490-498, 1999.

J. Vitus, E. M. Leung, M. A. Arkin, and . Bender, Processor Allocation on Cplant: Achieving General Processor Locality Using One-Dimensional Allocation Strategies, pp.296-304, 2002.

J. Karel-lenstra, D. B. Shmoys, and É. Tardos, Approximation algorithms for scheduling unrelated parallel machines, Mathematical Programming, vol.23, issue.1-3, pp.259-271, 1990.
DOI : 10.1007/BF01585745

W. Ludwig, P. Lucarelli, F. Machado-mendonça, D. Trystram, and F. Wagner, Scheduling Malleable and Nonmalleable Parallel Tasks URL: https: //dl.acm.org/citation.cfm?id=314464.314491. | cit Contiguity and Locality in Backfilling Scheduling, In: SODA. ACM/SIAM, vol.72, pp.167-176, 1994.

F. Monna, Scheduling for new computing platforms with GPUs URL: https, 2014.

G. M. Morton, A computer Oriented Geodetic Data Base; and a New Technique in File Sequencing URL: https, Tech. rep. IBM Ltd, p.72, 1966.

G. Mounié, C. Rapine, and D. Trystram, A 3/2-Approximation Algorithm for Scheduling Independent Monotonic Malleable Tasks Solving very large instances of the scheduling of independent tasks problem on the GPU, In: SIAM Journal on Computing Journal of Parallel and Distributed Computing, vol.372, issue.731, pp.401-412, 2007.

J. Antonio-pascual, J. Miguel-alonso, and J. A. Lozano, Applicationaware metrics for partition selection in cube-shaped topologies, Parallel Computing, vol.405, pp.129-139, 2014.

J. C. Phillips, J. E. Stone, and K. Schulten, Adapting a Message- Driven Parallel Application to GPU-Accelerated Clusters DOI: 10.1145/1413370.1413379. | cit. on p. 16 [RN12] Gurulingesh Raravi and Vincent Nélis A PTAS for Assigning Sporadic Tasks on Two-type Heterogeneous Multiprocessors Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems, Son+10] Fengguang Song, Hatem Ltaief, Bilel Hadri, and Jack Dongarra, pp.1-8, 2008.
DOI : 10.1109/sc.2008.5214716

URL : http://mc.stanford.edu/cgi-bin/images/8/8a/SC08_NAMD.pdf

B. David, É. Shmoys, ]. A. Tardosste97, and . Steinberg, An approximation algorithm for the generalized assignment problem Stanimire Tomov, and Jack Dongarra Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems A Strip-Packing Algorithm with Absolute Performance Bound 2, DOI: 10.1007/BF01585178. | cit. on p. 18 [STD12] Fengguang Song, pp.461-474, 1993.

V. Evgeny, N. Shchepin, and . Vakhania, An optimal rounding gives a better approximation for scheduling unrelated machines, Operations Research Letters, vol.33, issue.2, pp.127-133, 2005.

C. Stein and J. Wein, On the existence of schedules that are nearoptimal for both makespan and total weighted completion time, Operations Research Letters, vol.21397, pp.115-122, 1997.
DOI : 10.1016/s0167-6377(97)00025-4

S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.5-6, pp.25-232, 2010.
DOI : 10.1016/j.parco.2009.12.005

URL : http://icl.cs.utk.edu/news_pub/submissions/tdb.pdf

[. Tessier, P. Malakar, V. Vishwanath, E. Jeannot, and F. Isaila, Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC), pp.73-81, 2016.
DOI : 10.1109/COMHPC.2016.013

URL : https://hal.archives-ouvertes.fr/hal-01394741

H. Topcuoglu, S. Hariri, and M. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems, pp.77-260, 2002.
DOI : 10.1109/71.993206

URL : http://meseec.ce.rit.edu/eecc722-fall2002/papers/hc/5/l0260.pdf

[. Tuncer, V. J. Leung, and A. Kivilcim-coskun, PaCMap, Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pp.37-46, 2015.
DOI : 10.1109/SC.2012.47

URL : http://dl.acm.org/ft_gateway.cfm?id=2751225&type=pdf

J. Turek, J. L. Wolf, and P. S. Yu, Approximate algorithms scheduling parallelizable tasks, Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures , SPAA '92, pp.323-332, 1992.
DOI : 10.1145/140901.141909

L. Gabriel and V. , A Bridging Model for Parallel Computation, Communications of the ACM, vol.338, issue.7 8, 1990.

A. Yarkhan, J. Kurzak, and J. Dongarra, QUARK Users' Guide: QUeueing And Runtime for Kernels. Tech. rep. ICL-UT-11-02, p.30, 2011.

I. Aggregation-of-many and I. , 94 A11 List of Tables 3.1 Parameter settings used to generate scheduling instances . . . . . . 57 3.2 HEFT-like heuristics used for comparison, p.61