E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner et al., Task-based FMM for multicore architectures, SIAM Journal on Scientific Computing, vol.36, issue.1, pp.66-93, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00807368

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis et al., Tensorflow: A system for large-scale machine learning, Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), 2016.

E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner et al., Task-based FMM for heterogeneous architectures, Concurrency and Computation: Practice and Experience, vol.28, issue.9, 2016.
URL : https://hal.archives-ouvertes.fr/hal-00974674

E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. Mckenney et al., LAPACK: A Portable Linear Algebra Library for High-performance Computers, Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, Supercomputing '90, pp.2-11, 1990.

E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou et al., Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures, José M. Laginha M. Palma, Michel Daydé, Osni Marques, and João Correia Lopes, pp.129-138, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00548906

E. Agullo, A. Buttari, A. Guermouche, and F. Lopez,

, Multifrontal QR Factorization for Multicore Architectures over Runtime Systems, 19th International Conference Euro-Par, vol.8097, pp.521-532, 2013.

E. Agullo, A. Buttari, A. Guermouche, and F. Lopez, Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01333645

E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo et al., An Extension of the StarSs Programming Model for Platforms with Multiple GPUs, Proceedings of the 15th Euro-Par Conference, 2009.

E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin et al., The design of OpenMP tasks, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.3, pp.404-418, 2009.

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak et al., Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, In Journal of Physics: Conference Series, vol.180, p.12037, 2009.

D. Akhmetova, G. Kestor, R. Gioiosa, S. Markidis, and E. Laure, On the application task granularity and the interplay with the scheduling overhead in manycore shared memory systems, 2015 IEEE International Conference on Cluster Computing (CLUSTER), vol.00, pp.428-437, 2015.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, vol.23, pp.187-198, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

J. Barbosa, Gama framework: Hardware aware scheduling in heterogeneous environments, 2012.

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier et al., DAGuE: A generic distributed DAG engine for high performance computing, 2010.

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, H. Haidar et al., Flexible development of dense linear algebra algorithms on massively parallel architectures with dplasma, Proceedings of the 25th IEEE International Symposium on Parallel & Distributed Processing Workshops and Phd Forum (IPDPSW'11), pp.1432-1441, 2011.

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, and J. Dongarra, From serial loops to parallel execution on distributed systems, European Conference on Parallel Processing, pp.246-257, 2012.

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault et al., PaRSEC: A programming paradigm exploiting heterogeneity for enhancing scalability, Computing in Science and Engineering, vol.15, issue.6, pp.36-45, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00930217

J. Marsha, P. Berger, J. Colella-;-françois-broquedis, S. Clet-ortega, N. Moreaud et al., hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications, Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010), vol.82, pp.180-186, 1989.

J. Bueno, A. Duran, X. Martorell, E. Ayguadé, R. M. Badia et al., Poster: programming clusters of gpus with ompss, Proceedings of the international conference on Supercomputing, ICS '11, pp.378-378, 2011.

M. Bebendorf, Approximation of Boundary Element Matrices, Numerische Mathematik, vol.86, pp.565-589, 2000.

O. Beaumont, L. Eyraud-dubois, and Y. Gao, Influence of Tasks Duration Variability on Task-Based Runtime Schedulers, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01716489

S. Brinkmann, J. Gracia, C. Niethammer, and R. Keller, TEMANEJO-a debugger for task based parallel programming models, 2011.

R. Bleuse, S. Hunold, S. Kedad-sidhoum, and F. Monna, Scheduling Independent Moldable Tasks on Multi-Cores with GPUs, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01263100

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall et al., Cilk: An Efficient Multithreaded Runtime System, SIGPLAN Not, vol.30, issue.8, pp.207-216, 1995.

P. Brucker and S. Knust, Complexity results for scheduling problems, 2009.

R. Bleuse, S. Kedad-sidhoum, F. Monna, G. Mounié, and D. Trystram, Scheduling Independent Tasks on Multi-cores with GPU Accelerators, Concurr. Comput. : Pract. Exper, vol.27, issue.6, pp.1625-1638, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01081625

D. Robert, C. E. Blumofe, and . Leiserson, Scheduling multithreaded computations by work stealing, Journal of the ACM (JACM), vol.46, issue.5, pp.720-748, 1999.

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures, Parallel Computing, vol.35, issue.1, pp.38-53, 2009.

R. M. Badia, J. Labarta, R. Sirvent, J. M. Pérez, J. M. Cela et al., COMP Superscalar, an interoperable programming framework, SoftwareX, vol.1, pp.32-36, 2003.

P. Bellens, J. M. Pérez, F. Cabarcas, A. Ramírez, R. M. Badia et al., Cellss: Scheduling techniques to better exploit memory hierarchy, Scientific Programming, vol.17, pp.77-95, 2009.

J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell et al., Productive programming of gpu clusters with ompss, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.557-568, 2012.

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, Legion: expressing locality and independence with logical regions, Proceedings of the international conference on high performance computing, networking, storage and analysis, p.66, 2012.

B. L. Chamberlain, D. Callahan, and H. P. Zima, Parallel programmability and the chapel language, The International Journal of High Performance Computing Applications, vol.21, issue.3, pp.291-312, 2007.

J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov et al., ScaLAPACK: A portable linear algebra library for distributed memory computers-Design issues and performance, Applied Parallel Computing Computations in Physics, pp.95-106, 1996.

S. Collange, M. Daumas, D. Defour, and D. Parello, Barra: A parallel functional simulator for gpgpu, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp.351-360, 2010.

K. Coulomb, M. Faverge, J. Jazeix, and O. Lagrasse, Arthur Redondy, and Clément Vuchener. Vite's project page

L. Clarke, I. Glendinning, and R. Hempel, The MPI message passing interface standard. In Programming environments for massively parallel distributed systems, pp.213-218, 1994.

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra et al., X10: An object-oriented approach to non-uniform cluster computing, SIGPLAN Not, vol.40, issue.10, pp.519-538, 2005.

C. Louis-, E. Canon, and . Jeannot, Evaluation and optimization of the robustness of dag schedules in heterogeneous environments, IEEE Transactions on Parallel and Distributed Systems, vol.99, pp.532-546, 2009.

K. , M. Chandy, and L. Lamport, Distributed snapshots: Determining global states of distributed systems, ACM Trans. Comput. Syst, vol.3, issue.1, pp.63-75, 1985.

M. Cosnard and M. Loi, Automatic task graph generation techniques, Proceedings of the Twenty-Eighth Hawaii International Conference on, vol.2, pp.113-122, 1995.

H. Casanova, A. Legrand, and M. Quinson, SimGrid: a Generic Framework for Large-Scale Distributed Experiments, 10th IEEE International Conference on Computer Modeling and Simulation, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00260697

E. F. Codd, Multiprogram scheduling: Parts 1 and 2. introduction and theory, Commun. ACM, vol.3, issue.6, pp.347-350, 1960.

E. Chan, P. Field-g-van-zee, E. S. Bientinesi, G. Quintana-orti, R. Quintana-orti et al., Supermatrix: a multithreaded runtime scheduling system for algorithms-by-blocks, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp.123-132, 2008.

R. Dolbeau, S. Bihan, and F. Bodin, HMPP: A hybrid multi-core parallel programming environment, 2007.

A. Danalis, G. Bosilca, A. Bouteiller, T. Herault, and J. Dongarra, PTG: an abstraction for unhindered parallelism, DomainSpecific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pp.21-30, 2014.

J. Jack, J. R. Dongarra, . Bunch, B. Cleve, G. Moler et al., LINPACK users' guide, 1979.

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin et al., Large Scale Distributed Deep Networks. In f. pereira, c. j. c. burges, l. bottou, and k. q. weinberger, editors, advances in neural information processing systems 25, pp.1223-1231, 2012.

U. Dastgeer, J. Enmyren, and C. W. Kessler, Auto-tuning skepu: a multi-backend skeleton programming framework for multi-gpu systems, Proceeding of the 4th international workshop on Multicore software engineering, IWMSE '11, pp.25-32, 2011.

J. B. Dennis, First version of a data fow procedure language, Programming Symposium, pp.362-376, 1974.

J. Dean and S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Commun. ACM, vol.51, issue.1, pp.107-113, 2008.

W. Edsger and . Dijkstra, Een algorithme ter voorkoming van de dodelijke omarming, 1965.

W. Edsger and . Dijkstra, The mathematics behind the banker's algorithm. Selected Writings on Computing: A personal Perspective, 1982.

P. Dutot, G. Mounié, and D. Trystram, Scheduling Parallel Tasks: Approximation Algorithms, Handbook of Scheduling: Algorithms, Models, and Performance Analysis, vol.26, pp.26-27, 2004.
URL : https://hal.archives-ouvertes.fr/hal-00003126

V. Danjean, R. Namyst, and P. Wacrenier, An efficient multi-level trace toolkit for multi-threaded applications, EuroPar, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00360309

J. Dongarra, Architecture-aware algorithms for scalable performance and resilience on heterogeneous architectures, vol.3, 2013.

A. Denis and F. Trahay, MPI Overlap: Benchmark and Analysis, International Conference on Parallel Processing, 45th International Conference on Parallel Processing, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01324179

J. Dongarra and D. Walker, Software libraries for linear algebra computations on high performance computers, SIAM Review, vol.37, issue.2, pp.151-180, 1995.

T. El, -. , and F. Cantonnet, UPC performance and potential: A NPB experimental study, Supercomputing, ACM/IEEE 2002 Conference, pp.17-17, 2002.

J. Ellson, E. Gansner, L. Koutsofios, S. North, and G. Woodhull, Short Description, and Lucent Technologies, Lecture Notes in Computer Science, pp.483-484, 2001.

J. Enmyren and C. W. Kessler, Skepu: a multi-backend skeleton programming library for multi-gpu systems, Proceedings of the fourth international workshop on High-level parallel programming and applications, HLPP '10, pp.5-14, 2010.

M. Faverge and P. Ramet, A NUMA Aware Scheduler for a Parallel Sparse Direct Solver, Workshop on Massively Multiprocessor and Multicore Computers, page 5p, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00549827

F. Galilee, G. G. Cavalheiro, J. Roch, and M. Doreille, Athapascan-1: Online building data flow graph in a parallel language, Parallel Architectures and Compilation Techniques, pp.88-95, 1998.

F. G. John-a-gunnels, G. M. Gustavson, R. Henry, and . Van-de-geijn, FLAME: Formal linear algebra methods environment, ACM Transactions on Mathematical Software (TOMS), vol.27, issue.4, pp.422-455, 2001.

M. R. Garey and D. S. Johnson, Computers and Intractability, a Guide to the Theory of NP-Completeness, 1979.

T. Gautier, V. F. Joao, N. Lima, B. Maillard, and . Raffin, Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures, Parallel & Distributed Processing (IPDPS), pp.1299-1308, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00799904

G. Ronald, Bounds for certain multiprocessing anomalies, Bell System Technical Journal, vol.45, issue.9, pp.1563-1581, 1966.

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett et al., Mpi+ mpi: a new hybrid approach to parallel programming with mpi plus shared memory, Computing, vol.95, issue.12, pp.1121-1136, 2013.

C. A. Hoare, Communicating sequential processes, Commun. ACM, vol.21, pp.666-677, 1978.

[. Hrf-+-10b]-everton-hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard, Multi-gpu and multi-cpu parallelization for interactive physics simulations, Euro-Par 2010-Parallel Processing, vol.6272, pp.235-246, 2010.

I. D. Mironescu and L. Vin¸tanvin¸tan, Coloured petri net modelling of task scheduling on a heterogeneous computational node, IEEE 10th International Conference on Intelligent Computer Communication and Processing (ICCP), pp.323-330, 2014.

, Intel Math Kernel Library. Reference Manual. Intel Corporation, 2009.

A. Knüpfer, R. Brendel, H. Brunst, H. Mix, and W. E. Nagel, Introducing the open trace format (otf), Computational Science-ICCS 2006, pp.526-533, 2006.

H. Kaiser, M. Brodowicz, and T. Sterling, Parallex an advanced parallel execution model for scaling-impaired applications, 2009 International Conference on Parallel Processing Workshops, pp.394-401, 2009.

H. Kaiser, T. Heller, B. Adelstein-lelbach, A. Serio, and D. Fey, Hpx: A task based programming model in a global address space, Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, vol.14, pp.1-6, 2014.

V. Laxmikant, S. Kale, and . Krishnan, Charm++: a portable concurrent object oriented system based on c++, Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications, OOPSLA '93, pp.91-108, 1993.

A. A. Khan, C. L. Mccreary, and M. S. Jones, A comparison of multiprocessor scheduling heuristics, Proceedings of the 1994 International Conference on Parallel Processing, vol.II, pp.243-250, 1994.

L. V. Kalé, B. Ramkumar, A. B. Sinha, and V. A. Saletore, The CHARM Parallel Programming Language and System: Part II-The Runtime system, 1994.

J. Kergommeaux, B. Stein, and M. Martin, Paje: An extensible environment for visualizing multithreaded program executions, Proc. Euro-Par, pp.133-144, 1900.

L. Lamport, How to make a multiprocessor computer that correctly executes multiprocess programs, IEEE Trans. Comput, vol.28, pp.690-691, 1979.

O. S. Lawlor, Message passing for GPGPU clusters: CudaMPI, Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on, pp.1-8, 2009.

C. E. Leiserson, The Cilk++ concurrency platform, The Journal of Supercomputing, vol.51, pp.522-527, 2009.

X. Lacoste, M. Faverge, G. Bosilca, P. Ramet, and S. Thibault, Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes, Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp.29-38, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00925017

R. J. Chuck-l-lawson, . Hanson, F. T. David-r-kincaid, and . Krogh, Basic linear algebra subprograms for Fortran usage, ACM Transactions on Mathematical Software (TOMS), vol.5, issue.3, pp.308-323, 1979.

G. Julia-l-lawall, L. Muller, and . Barreto, Capturing OS expertise in an Event Type System: the Bossa experience, Proceedings of the 10th workshop on ACM SIGOPS European workshop, pp.54-61, 2002.

J. K. Lenstra, D. B. Shmoys, and É. Tardos, Approximation algorithms for scheduling unrelated parallel machines. Mathematical programming, 1990.

L. Marchal, H. Nagy, B. Simon, and F. Vivien, Parallel scheduling of dags under memory constraints, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.204-213, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01828312

F. Monna, Scheduling for new computing platforms with GPUs, 2014.
URL : https://hal.archives-ouvertes.fr/tel-01127919

. Hasnaina, U. Mandviwala, K. Ramachandran, and . Knobe, Capsules: Expressing composable computations in a parallel programming model, Languages and Compilers for Parallel Computing, vol.5234, pp.276-291, 2008.

. Openacc-standard, The OpenACC application programming interface, 2013.

J. Planas, M. Rosa, E. Badia, J. Ayguadé, and . Labarta, Hierarchical task-based programming with StarSs, International Journal of High Performance Computing Applications, vol.23, issue.3, pp.284-299, 2009.

A. Podobas, M. Brorsson, and K. Faxén, A comparison of some recent task-based parallel programming models, 3rd Workshop on Programmability Issues for Multi-Core Computers, 2010.

M. Josep, R. M. Pérez, J. Badia, and . Labarta, A dependency-aware taskbased programming environment for multi-core architectures, Proceedings of the 2008 IEEE International Conference on Cluster Computing, pp.142-151, 2008.

H. Pan, B. Hindman, and K. Asanovi´casanovi´c, Composing parallel software efficiently with lithe, Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI '10, pp.376-387, 2010.

A. Pereira, A. Onofre, and A. Proenca, Tuning pipelined scientific data analyses for efficient multicore execution, 2016 International Conference on High Performance Computing Simulation (HPCS), pp.751-758, 2016.

F. Pellegrini and J. Roman, Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs, High-Performance Computing and Networking, vol.1067, pp.493-498, 1996.

J. Reinders, Intel Threading Building Blocks, 2007.

M. Rocklin, Dask: Parallel computation with blocked algorithms and task scheduling, Proceedings of the 14th Python in Science Conference, pp.130-136, 2015.

C. Martin, D. J. Rinard, M. S. Scales, and . Lam, Jade: A high-level, machine-independent language for parallel programming, Computer, vol.26, pp.28-38, 1993.

B. Simon, Scheduling task graphs on modern computing platforms. Theses, 2018.
URL : https://hal.archives-ouvertes.fr/tel-01843558

E. Slaughter, W. Lee, S. Treichler, M. Bauer, and A. Aiken, Regent: A high-productivity programming language for hpc with logical regions, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p.81, 2015.

A. Sodani, Knights landing (knl): 2nd generation intel xeon phi processor, 2015 IEEE Hot Chips 27 Symposium (HCS), pp.1-24, 2015.

L. João, A. Sobral, and . Proença, Dynamic grain-size adaptation on object oriented parallel programming the SCOOPP approach, Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing, IPPS '99/SPDP '99, pp.728-732, 1999.

C. Simmendinger, M. Rahn, and D. Gruenewald, The gaspi api: A failure tolerant pgas api for asynchronous dataflow on heterogeneous architectures, Sustained Simulation Performance, pp.17-32, 2014.

S. Blackford, The Two-dimensional Block-Cyclic Distribution, 1997.

F. Song, A. Yarkhan, and J. Dongarra, Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp.1-11, 2009.

C. Szyperski, Component technology: what, where, and how?, Proceedings of the 25th international conference on Software engineering, pp.684-693, 2003.

F. Trahay and A. Denis, A scalable and generic task scheduling system for communication libraries, IEEE International Conference on Cluster Computing, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00408521

S. Tomov, J. Dongarra, and M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, vol.36, issue.56, pp.232-240, 2010.

P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar et al., A taxonomy of task-based parallel programming technologies for high-performance computing, The Journal of Supercomputing, vol.74, issue.4, pp.1422-1434, 2018.

E. Tejedor, M. Farreras, D. Grove, M. Rosa, G. Badia et al., A high-productivity task-based programming model for clusters, Concurrency and Computation: Practice and Experience, vol.24, issue.18, pp.2421-2448, 2012.

H. Topcuoglu, S. Hariri, and M. Wu, Task scheduling algorithms for heterogeneous processors, Proceedings of the Eighth Heterogeneous Computing Workshop, HCW '99, vol.3, 1999.

M. Tillenius, Scientific Computing on Multicore Architectures, 2014.

M. Tillenius, SuperGlue: A shared memory framework using data versioning for dependency-aware task-based parallelization, SIAM Journal on Scientific Computing, vol.37, issue.6, pp.617-642, 2015.

S. Thibault, R. Namyst, and P. Wacrenier, Building portable thread schedulers for hierarchical multiprocessors: The bubblesched framework, Euro-Par 2007 Parallel Processing, vol.4641, pp.42-51
URL : https://hal.archives-ouvertes.fr/inria-00154506

. Springer, , 2007.

S. Tzeng, A. Patney, and J. D. Owens, Poster: Task management for irregular workloads on the gpu, Proceeding of NVIDIA GPU Technology Conference, 2010.

F. Trahay, De l'interaction des communications et de l'ordonnancement de threads au sein des grappes de machines multi-coeurs, Alexandre Informatique Bordeaux, vol.1, 2009.
URL : https://hal.archives-ouvertes.fr/tel-00469488

R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, Multi2sim: A simulation framework for cpu-gpu computing, Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pp.335-344, 2012.

J. D. Ullman, NP-complete scheduling problems, Journal of Computer and System sciences, vol.10, issue.3, pp.384-393, 1975.

L. G. Valiant, A bridging model for parallel computation, Commun. ACM, vol.33, issue.8, pp.103-111, 1990.

W. Wu, A. Bouteiller, G. Bosilca, M. Faverge, and J. Dongarra, Hierarchical DAG scheduling for Hybrid Distributed Systems, 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01078359

A. Yarkhan, Dynamic task execution on shared and distributed memory architectures, 2012.

A. Yarkhan, J. Kurzak, and J. Dongarra, Quark users' guide: Queueing and runtime for kernels

A. Yarkhan, J. Kurzak, P. Luszczek, and J. Dongarra, Porting the plasma numerical library to the openmp standard, International Journal of Parallel Programming, vol.45, issue.3, pp.612-633, 2017.

A. Zafari, Taskuniverse: A task-based unified interface for versatile parallel execution, Parallel Processing and Applied Mathematics, pp.169-184, 2018.

Y. Zheng, A. Kamil, B. Michael, H. Driscoll, K. Shan et al., UPC++: a PGAS Extension for C++, Parallel and Distributed Processing Symposium, pp.1105-1114, 2014.

A. Zafari and E. Larsson, Distributed dynamic load balancing for task parallel programming, 2018.

A. Zafari, E. Larsson, and M. Tillenius, DuctTeip: A task-based parallel programming framework for distributed memory architectures, 2016.

S. Thibault, Ordonnancement de processus légers sur architectures multiprocesseurs hiérarchiques : BubbleSched, une approche exploitant la structure du parallélisme des applications, vol.1, 2007.

E. Agullo, O. Aumage, M. Faverge, N. Furmento, F. Pruvost et al., Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01618526

P. Arras, D. Fuin, E. Jeannot, A. Stoutchinin, and S. Thibault, List Scheduling in Embedded Systems Under Memory Constraints, International Journal of Parallel Programming, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00906117

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par, vol.23, pp.187-198, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

G. Vinicius, L. M. Pinto, L. Schnorr, A. Stanisic, S. Legrand et al., A Visual Performance Analysis Framework for Task-based Parallel Applications running on Hybrid Clusters, Concurrency and Computation: Practice and Experience, 2018.

L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J. Méhaut, Faithful Performance Prediction of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures, Concurrency and Computation: Practice and Experience, p.16, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01147997

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst et al., A Hybridization Methodology for HighPerformance Linear Algebra Software for GPUs, GPU Computing Gems, vol.2, 2010.

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief et al., QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators, 25th IEEE International Parallel & Distributed Processing Symposium, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00547614

P. Arras, D. Fuin, E. Jeannot, A. Stoutchinin, and S. Thibault, List Scheduling in Embedded Systems under Memory Constraints, SBAC-PAD'2013-25th International Symposium on Computer Architecture and High-Performance Computing, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00906117

P. Arras, D. Fuin, E. Jeannot, and S. Thibault, DKPN: A Composite Dataflow/Kahn Process Networks Execution Model, 24th Euromicro International Conference on Parallel, Distributed and Network-based processing, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01234333

C. Augonnet, J. Clet-ortega, S. Thibault, and R. Namyst, DataAware Task Scheduling on Multi-Accelerator based Platforms, The 16th International Conference on Parallel and Distributed Systems (ICPADS), 2010.
URL : https://hal.archives-ouvertes.fr/inria-00523937

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures, Proceedings of the 15th International Euro-Par Conference, vol.5704, pp.863-874, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00384363

S. Benkner, E. Bajrovic, E. Marth, M. Sandrieser, R. Namyst et al., High-Level Support for Pipeline Parallelism on Many-Core Architectures, Europar-International European Conference on Parallel and Distributed Computing2012, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00697020

F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications, Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010), pp.180-186, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00429889

U. Dastgeer, C. Kessler, and S. Thibault, Flexible runtime support for efficient skeleton programming on hybrid systems, Proceedings of the International Conference on Parallel Computing (ParCo), Applications, Tools and Techniques on the Road to Exascale Computing, vol.22, pp.159-166, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00606200

C. Kessler, U. Dastgeer, S. Thibault, R. Namyst, A. Richards et al., Siegfried Benkner, Jesper Larsson Träff, and Sabri Pllana. Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems, Design, Automation and Test in Europe (DATE), 2012.

V. Martínez, D. Michéa, F. Dupros, O. Aumage, S. Thibault et al., Towards seismic wave modeling on heterogeneous many-core architectures using task-based runtime system, 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Florianopolis, 2015.

T. Odajima, T. Boku, M. Sato, T. Hanawa, Y. Kodama et al., Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing, The 2013 International Symposium on Advances of Distributed and Parallel Computing (ADPC 2013), 2013.
URL : https://hal.archives-ouvertes.fr/hal-00920915

S. Ohshima, S. Katagiri, K. Nakajima, S. Thibault, and R. Namyst, Implementation of FEM Application on GPU with StarPU, SIAM CSE13SIAM Conference on Computational Science and Engineering, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00926144

L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J. Méhaut, Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures, Euro-par-20th International Conference on Parallel Processing, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01011633

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst et al., Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators, Symposium on Application Accelerators in High Performance Computing (SAAHPC), 2010.
URL : https://hal.archives-ouvertes.fr/inria-00547616

E. Agullo, O. Aumage, M. Faverge, N. Furmento, F. Pruvost et al., Harnessing clusters of hybrid nodes with a sequential task-based programming model, 8th International Workshop on Parallel Matrix Algorithms and Applications, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01283949

E. Agullo, O. Beaumont, L. Eyraud-dubois, J. Herrmann, S. Kumar et al., Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms, Heterogeneity in Computing Workshop, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01120507

C. Augonnet, S. Thibault, and R. Namyst, Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures, Proceedings of the International Euro-Par Workshops 2009, HPPC'09, vol.6043, pp.56-65, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00421333

C. Augonnet, S. Thibault, R. Namyst, and M. Nijhuis, Exploiting the Cell/BE architecture with the StarPU unified runtime system, SAMOS Workshop-International Workshop on Systems, Architectures, Modeling, and Simulation, vol.5657, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00378705

L. Vinicius-garcia-pinto, A. Stanisic, L. M. Legrand, S. Schnorr, V. Thibault et al., Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach, 3rd Workshop on Visual Performance Analysis (VPA), 2016.

X. Lacoste, M. Faverge, P. Ramet, S. Thibault, and G. Bosilca, Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes, HCW'2014 workshop of IPDPS, pp.8446-8446, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00925017

C. Rossignon, P. Hénon, O. Aumage, and S. Thibault, A NUMAaware fine grain parallelization framework for multi-core architecture, PDSEC-14th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing2013, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00858350

M. Sergent, D. Goudin, S. Thibault, and O. Aumage, Controlling the Memory Subscription of Distributed Applications with a Task-Based Runtime System, 21st International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01284004

S. Thibault, F. Broquedis, B. Goglin, R. Namyst, P. Wacrenier et al., An Efficient OpenMP Runtime System for Hierarchical Architectures, A Practical Programming Model for the Multi-Core Era, 3rd International Workshop on OpenMP, vol.4935, pp.161-172, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00154502

P. Virouleau, B. Pierrick, F. Broquedis, N. Furmento, S. Thibault et al., Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite, 10th International Workshop on OpenMP, IWOMP2014, 10th International Workshop on OpenMP, IWOMP2014, pp.16-29, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01081974

A. Sidi, P. Mahmoudi, C. Manneback, S. Augonnet, and . Thibault, Traitements d'images sur architectures parallèles et hétérogènes, 2012.

P. Arras, D. Fuin, E. Jeannot, A. Stoutchinin, and S. Thibault, Ordonnancement de liste dans les systèmes embarqués sous contrainte de mémoire, 21èmes Rencontres Francophones du Parallélisme (RenPar'21), 2013.

A. Sidi, P. Mahmoudi, C. Manneback, S. Augonnet, and . Thibault, Détection optimale des coins et contours dans des bases d'images volumineuses sur architectures multicoeurs hétérogènes, 20èmes Rencontres Francophones du Parallélisme (RenPar'20), 2011.

G. Vinicius, L. M. Pinto, A. Schnorr, S. Legrand, L. Thibault et al., Detecção de Anomalias de Desempenho em Aplicações de Alto Desempenho baseadas em Tarefas em Clusters Híbridos, 17o Workshop em Desempenho de Sistemas Computacionais e de Comunicação (WPerformance), 2018.

E. Agullo, B. Bramas, O. Coulaud, L. Stanisic, and S. Thibault, Modeling Irregular Kernels of Task-based codes: Illustration with the Fast Multipole Method, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01474556

E. Agullo, A. Buttari, M. Byckling, A. Guermouche, and I. Masliah, Achieving high-performance with a sparse direct solver on Intel KNL, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01473475

C. Augonnet, O. Aumage, N. Furmento, S. Thibault, and R. Namyst, StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00725477

C. Augonnet, S. Thibault, and R. Namyst, StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00467677

X. Lacoste, M. Faverge, P. Ramet, S. Thibault, and G. Bosilca, Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00925017

C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault, StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators, LNCS. Springer, vol.7490, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00725477

P. Arras, Scheduling of dynamic streaming applications on hybrid embedded MPSoCs comprising programmable computing units and hardware accelerators, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01159519

C. Augonnet, Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System's Perspective, 2011.

S. Kumar, Scheduling of Dense Linear Algebra Kernels on Heterogeneous Resources, 2017.
URL : https://hal.archives-ouvertes.fr/tel-01538516

C. Rossignon, A fine grain model programming for parallelization of sparse linear solver, 2015.
URL : https://hal.archives-ouvertes.fr/tel-01230876

M. Sergent, Scalability of a task-based runtime system for dense linear algebra applications, 2016.
URL : https://hal.archives-ouvertes.fr/tel-01483666

E. Agullo, O. Aumage, B. Bramas, O. Coulaud, and S. Pitoiset, Bridging the gap between OpenMP 4.0 and native runtime systems for the fast multipole method, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01372022

E. Agullo, O. Aumage, B. Bramas, O. Coulaud, and S. Pitoiset, Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method, IEEE Transactions on Parallel and Distributed Systems, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01517153

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou et al., LU factorization for accelerator-based systems, 9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11), 2011.
URL : https://hal.archives-ouvertes.fr/hal-00654193

E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner et al., Task-based FMM for heterogeneous architectures, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00974674

E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner et al., Task-Based FMM for Multicore Architectures, SIAM Journal on Scientific Computing, vol.36, issue.1, pp.66-93, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00807368

E. Agullo, B. Bramas, O. Coulaud, M. Khannouz, and L. Stanisic, Task-based fast multipole method for clusters of multicore processors, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01387482

E. Agullo, O. Beaumont, L. Eyraud-dubois, and S. Kumar, Are Static Schedules so Bad ? A Case Study on Cholesky Factorization, Proceedings of the 30th IEEE International Parallel & Distributed Processing Symposium, IPDPS'16, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01223573

E. Agullo, . Giraud, . Guermouche, J. Nakov, and . Roman, Task-based Conjugate Gradient: from multi-GPU towards heterogeneous architectures, Research Report, vol.8912, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01334734

C. Augonnet and R. Namyst, A unified runtime system for heterogeneous multicore architectures, Proceedings of the International Euro-Par Workshops 2008, HPPC'08, vol.5415, pp.174-183, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00326917

C. Augonnet, Vers des supports d'exécution capables d'exploiter les machines multicoeurs hétérogènes, 2008.

C. Augonnet, StarPU: un support exécutif unifié pour les architectures multicoeurs hétérogènes, 19èmes Rencontres Francophones du Parallélisme (RenPar'19), 2009.

O. Beaumont, T. Cojean, L. Eyraud-dubois, A. Guermouche, and S. Kumar, Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources, International Conference on High Performance Computing, Data, and Analytics (HiPC), 2016.
URL : https://hal.archives-ouvertes.fr/hal-01361992

O. Beaumont, L. Eyraud-dubois, and S. Kumar, Approximation proofs of a fast and efficient list scheduling algorithm for task-based runtime systems on multicores and gpus, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.768-777, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01386174

C. Bordage, Ordonnancement dynamique, adapté aux architectures hétérogènes, de la méthode multipôle pour les équations de Maxwell, en électromagnétisme, Université Bordeaux 1, 2013.

S. Benkner, S. Pllana, P. Jesper-larsson-träff, U. Tsigas, C. Dolinsky et al., PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems, IEEE Micro, vol.31, issue.5, pp.28-41, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00648480

J. Marie-couteyen-carpaye, J. Roman, and P. Brenner, Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit FiniteVolume CFD Code with Adaptive Time Stepping, International Journal of Computational Science and Engineering, pp.1-22, 2017.

T. Cojean, A. Guermouche, A. Hugo, R. Namyst, and P. Wacrenier, Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines, HeteroPar'2016 workshop of Euro-Par, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01181135

A. Chevalier, Critical resources management and scheduling under StarPU, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01718280

T. Cojean, Programmation of heterogeneous architectures using moldable tasks, 2018.
URL : https://hal.archives-ouvertes.fr/tel-01816341

, Ludovic Courtès. C Language Extensions for Hybrid CPU/GPU Programming with StarPU, 2013.

S. Henry, A. Denis, and D. Barthou, Programmation unifiée multiaccélérateur OpenCL, pp.1233-1249, 2012.

. Hdb-+-14]-sylvain, A. Henry, D. Denis, M. Barthou, R. Counilh et al., Toward OpenCL Automatic Multi-Device Support, 2014.

S. Henry, Programmation multi-accélérateurs unifiée en OpenCL, p.20

, Rencontres Francophones du Parallélisme (RenPar'20), 2011.

S. Henry, Modèles de programmation et supports exécutifs pour architectures hétérogènes, 2013.

S. Henry, ViperVM: a Runtime System for Parallel Functional HighPerformance Computing on Heterogeneous Architectures, 2nd Workshop on Functional High-Performance Computing (FHPC'13), 2013.
URL : https://hal.archives-ouvertes.fr/hal-00851122

A. Hugo, A. Guermouche, R. Namyst, and P. Wacrenier, Composing multiple StarPU applications over heterogeneous machines: a supervised approach, Third International Workshop on Accelerators and Hybrid Exascale Systems, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00824514

A. Hugo, Composabilité de codes parallèles sur architectures hétérogènes. Mémoire de master, 2011.

A. Hugo, Le problème de la composition parallèle : une approche supervisée, 21èmes Rencontres Francophones du Parallélisme (RenPar'21), 2013.

A. Hugo, Composability of parallel codes on heterogeneous architectures, 2014.
URL : https://hal.archives-ouvertes.fr/tel-01162975

J. Janzén, D. Black-schaffer, and A. Hugo, Partitioning GPUs for Improved Scalability, IEEE 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2016.

M. Sergent and S. Archipoff, Modulariser les ordonnanceurs de tâches : une approche structurelle, Conférence d'informatique en Parallélisme, Architecture et Système (ComPAS'2014), 2014.

L. Stanisic, E. Agullo, A. Buttari, A. Guermouche, A. Legrand et al., Fast and Accurate Simulation of Multithreaded Sparse Linear Algebra Solvers, The 21st IEEE International Conference on Parallel and Distributed Systems, 2015.
DOI : 10.1109/icpads.2015.67

URL : https://hal.archives-ouvertes.fr/hal-01180272