. .. Related-works,

. .. Towards-reproducible-data-analysis, 2.3 Reproduce analysis with the same inputs (with variations) (R1)

, Reproduce analysis on other inputs (with variation) (R2) . . 138 6

. .. Good-practices-for-reproducible-experiment, 140 6.3.1 P1: Use a long-term, publicly available, properly organized, version control repository

. .. Reproducibility, , p.147

. .. Tools,

. .. Conclusion, 161 6.1.3 Reproducible Software Environments with Nix, p.131

. .. Towards-reproducible-data-analysis, 2.3 Reproduce analysis with the same inputs (with variations) (R1)

.. .. Conclusion,

. .. Good-practices-for-reproducible-experiment, 140 6.3.1 P1: Use a long-term, publicly available, properly organized, version control repository, p.141

, P2: Use reusable open source software and proven simulators

. .. , 143 6.3.4 P4: Provide experiment design and workflow, p.145

. .. , P5: Provide inputs and results, p.146

. .. Reproducibility, , p.147

. .. Tools,

.. .. Conclusion,

G. Aupy, O. Beaumont, and L. Eyraud-dubois, Sizing and Partitioning Strategies for Burst-Buffers to, Reduce IO Contention. Inria, p.29, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02141616

T. E. Anderson, D. E. Culler, and D. Patterson, A case for NOW (Networks of Workstations), IEEE Micro, vol.15, issue.1, p.7, 1995.

R. Sadaf, H. N. Alam, K. El-harake, N. Howard, F. Stringfellow et al., Parallel I/O and the Metadata Wall, Proceedings of the Sixth Workshop on Parallel Data Storage. PDSW '11, p.29, 2011.

G. Amvrosiadis, J. Park, and G. R. Ganger, On the diversity of cluster workloads and its impact on research results, 2018 USENIX Annual Technical Conference (USENIX ATC 18), p.11, 2018.

R. Ananthanarayanan, K. Gupta, and P. Pandey, Cloud Analytics: Do We Really Need to Reinvent the Storage Stack?, In: Proceedings of the 2009 Conference on Hot Topics in Cloud Computing. HotCloud'09, p.29, 2009.

N. Andrade, W. Cirne, F. Vilar-brasileiro, and P. Roisenberg, OurGrid: An Approach to Easily Assemble Grids with Equitable Resource Sharing, Job Scheduling Strategies for Parallel Processing, 9th International Workshop, p.69, 2003.

D. P. Anderson, BOINC: a system for public-resource computing and storage, Fifth IEEE/ACM International Workshop on Grid Computing, p.69, 2004.

M. Asch, . Moore, and . Badia, Big data and extreme-scale computing: Pathways to Convergence-Toward a shaping strategy for a future software and data ecosystem for scientific inquiry, The International Journal of High Performance Computing Applications, vol.32, pp.435-479, 2018.

G. Avelino, L. T. Passos, A. C. Hora, and M. Valente, A Novel Approach for Estimating Truck Factors, p.128, 2016.

A. Azab, Enabling docker containers for high-performance and many-task computing, 2017 IEEE International Conference on, p.50, 2017.

D. Balouek, A. Carpen-amarie, and G. Charrier, Adding Virtualization Capabilities to the Grid'5000 Testbed, Cloud Computing and Services Science, vol.367, pp.3-20, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00946971

J. Bhimani, Z. Yang, M. Leeser, and N. Mi, Accelerating big data applications using lightweight virtualization framework on enterprise cloud, High Performance Extreme Computing Conference (HPEC), p.33, 2017.

M. S. Birrittella, M. Debbage, and R. Huggahalli, Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, p.22, 2015.

C. Boettiger, An introduction to Docker for reproducible research, ACM SIGOPS Operating Systems Review, vol.49, p.133, 2015.

B. L. Buzbee, W. J. Worlton, G. Michael, and G. Rodrigue, DOE research in utilization of high-performance computers, 1980.

B. Bzeznik, O. Henriot, V. Reis, O. Richard, and L. Tavard, Nix as HPC package management system, Proceedings of the Fourth International Workshop on HPC User Support Tools, p.58, 2017.

Y. Chen, S. Alspaugh, and R. Katz, Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads, Proceedings of the VLDB Endowment, vol.5, p.12, 2012.

N. Capit, G. D. Costa, and Y. Georgiou, A batch scheduler with high level components, CCGrid 2005, p.68, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00005106

P. Carbone, A. Katsifodimos, and S. Ewen, Apache Flink?: Stream and Batch Processing in a Single Engine, IEEE Data Eng. Bull, vol.38, p.68, 2015.

H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, Versatile, scalable, and accurate simulation of distributed applications and platforms, Journal of Parallel and Distributed Computing, vol.74, pp.2899-2917, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01017319

H. Casanova, S. Pandey, and J. Oeth, WRENCH: A Framework for Simulating Workflow Management Systems, WORKS 2018 -13th Workshop on Workflows in Support of Large-Scale Science. Dallas, United States, p.93, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01948162

M. Cox and D. Ellsworth, Application-controlled Demand Paging for Out-of-core Visualization, Proceedings of the 8th Conference on Visualization '97. VIS '97, vol.8, p.235, 1997.

N. Chaimov, A. Malony, and S. Canon, Scaling Spark on HPC systems, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pp.97-110, 2016.

S. J. Chapin, W. Cirne, G. Dror, and . Feitelson, Benchmarks and Standards for the Evaluation of Parallel Job Schedulers, Job Scheduling Strategies for Parallel Processing, p.157, 1999.

F. Chirigati, R. Rampin, D. Shasha, and J. Freire, Re-proZip: Computational Reproducibility With Ease, Proceedings of the 2016 International Conference on Management of Data. SIGMOD '16, p.133, 2016.

L. Courtès, Functional Package Management with Guix, p.57, 2013.

C. Collberg, T. Proebsting, and A. M. Warren, Repeatability and benefaction in computer systems research, University of Arizona, 2015.

L. Courtès and R. Wurmus-;-domingo-giménez, Reproducible and User-Controlled Software Environments in HPC with Guix, Euro-Par 2015: Parallel Processing Workshops. Ed. by Sascha Hunold, Alexandru Costan, p.131, 2015.

L. Courtès and R. Wurmus, Reproducible and user-controlled software environments in HPC with Guix, European Conference on Parallel Processing, p.58, 2015.

A. Devresse, F. Delalondre, and F. Schürmann, Nix Based Fully Automated Workflows and Ecosystem to Guarantee Scientific Result Reproducibility Across Software Environments and Systems, Proceedings of the 3rd International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering. SE-HPCCSE '15, p.58, 2015.

A. Devresse, F. Delalondre, and F. Schürmann, Nix based fully automated workflows and ecosystem to guarantee scientific result reproducibility across software environments and systems, Proceedings of the 3rd International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering, p.133, 2015.

E. Dolstra, E. Merijn-de-jonge, and . Visser, Nix: A Safe and Policy-Free System for Software Deployment, In: LISA, vol.4, p.56, 2004.

A. Degomme, A. Legrand, and G. Markomanolis, Simulating MPI applications: the SMPI approach, IEEE Transactions on Parallel and Distributed Systems, vol.28, p.92, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01415484

S. Derradji, T. Palfer-sollier, J. Panziera, A. Poudes, and F. W. Atos, The BXI Interconnect Architecture, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, p.22, 2015.

J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation -Volume 6. OSDI'04, pp.10-10, 2004.

J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol.51, pp.107-113, 2008.

S. Di, D. Kondo, and W. Cirne, Characterization and Comparison of Cloud versus Grid Workloads, 2012 IEEE International Conference on Cluster Computing, p.11, 2012.

J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart, LINPACK Users' Guide. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics, issue.6, 1979.

C. Dünner, T. P. Parnell, and K. Atasu, High-Performance Distributed Machine Learning using Apache SPARK, p.19, 2016.

P. Dutot, M. Mercier, M. Poquet, and O. Richard, Batsim: a Realistic Language-Independent Resources and Jobs Management Systems Simulator, 20th Workshop on Job Scheduling Strategies for Parallel Processing, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01333471

Z. Fadika, M. Govindaraju, R. Canon, and L. Ramakrishnan, Evaluating Hadoop for Data-Intensive Scientific Operations, 2012 IEEE Fifth International Conference on Cloud Computing(CLOUD), vol.00, p.30, 2012.

K. Fujiwara and H. Casanova, Speed and accuracy of network simulation in the simgrid framework, Proceedings of the 2nd international conference on Performance evaluation methodologies and tools. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, vol.154, p.94, 2007.

D. Feitelson, Workload Modeling for Computer Systems Performance Evaluation, 2015.

. Dror-g-feitelson, From repeatability to reproducibility and corroboration, In: ACM SIGOPS Operating Systems Review, vol.49, p.128, 2015.

G. Dror and . Feitelson, Resampling with Feedback -A New Paradigm of Using Workload Data for?Performance?Evaluation, Proceedings of the 22Nd International Conference on Euro-Par 2016: Parallel Processing, vol.9833, p.167, 2016.

W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, An updated performance comparison of virtual machines and Linux containers, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), vol.33, pp.44-46, 2015.

L. John, P. W. Furlani, and . Osel, Abstract Yourself With Modules, Proceedings of the 10th USENIX Conference on System Administration. LISA '96, p.53, 1996.

G. Dror, L. Feitelson, and . Rudolph, Toward convergence in job schedulers for parallel supercomputers, Job Scheduling Strategies for Parallel Processing, vol.68, p.63, 1996.

N. Gaffney, C. Jordan, T. Minyard, and D. Stanzione, Building Wrangler: A transformational data intensive resource for the open science community, 2014 IEEE International Conference on Big Data (Big Data, p.23, 2014.

N. Gaffney, C. Jordan, T. Minyard, and D. Stanzione, Building wrangler: A transformational data intensive resource for the open science community, Big Data (Big Data), p.63, 2014.

T. Gamblin, M. Legendre, and M. R. Collette, The Spack Package Manager: Bringing Order to HPC Software Chaos, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC '15, vol.40, p.54, 2015.

W. Gao, L. Wang, and J. Zhan, Big Data Dwarfs: Towards Fully Understanding Big Data Analytics Workloads, p.13, 2018.

Y. Georgiou, Contributions for Resource and Job Management in High Performance Computing, p.24, 2010.
URL : https://hal.archives-ouvertes.fr/tel-01499598

S. Ghemawat, H. Gobioff, and S. Leung, The Google File System, Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. SOSP '03, pp.29-43, 2003.

M. Geimer, K. Hoste, and R. Mclay, Modern Scientific Software Management Using EasyBuild and Lmod, 2014 First International Workshop on HPC User Support Tools, p.54, 2014.

A. Gittens, A. Devarakonda, and E. Racah, Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies, p.18, 2016.

C. Galleguillos, Z. Kiziltan, and A. Netti, AccaSim: An HPC Simulator for Workload Management, High Performance Computing. Ed. by Esteban Mocskos and Sergio Nesmachnow, p.92, 2018.

M. Geier, L. Nussbaum, and M. Quinson, In: WATERS -4th International Workshop on Analysis Tools and Methodologies for Embedded and Real-time Systems, p.154, 2013.

I. Gog, M. Schwarzkopf, A. Gleave, N. M. Robert, S. Watson et al., Firmament: fast, centralized cluster scheduling at scale, 12th {USENIX} Symposium on Operating Systems Design and Implementation, pp.99-115, 2016.

J. Gomes, I. C. Plasencia, and E. Bagnaschi, Enabling rootless Linux Containers in multi-user environments: the udocker tool, p.33, 2017.

J. Gomes, I. C. Plasencia, and E. Bagnaschi, Enabling rootless Linux Containers in multi-user environments: the udocker tool, vol.49, p.48, 2017.

Y. Georgiou, O. Richard, and N. Capit, Evaluations of the Lightweight Grid CIGRI upon the Grid5000 Platform, Third International Conference on e-Science and Grid Computing, p.69, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00687520

J. Gehr and J. Schneider, Measuring Fragmentation of Two-Dimensional Resources Applied to Advance Reservation Grid Scheduling, vol.87, pp.276-283, 2009.

J. E. Hannay, C. Macleod, and J. Singer, How do scientists develop and use scientific software, In: 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, p.125, 2009.

J. Howison and J. Bullard, Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature, Journal of the Association for Information Science and Technology, vol.67, p.125, 2016.

M. Herschel, R. Diestelkämper, and H. Ben-lahmar, A Survey on Provenance: What for? What Form? What from?, In: The VLDB Journal, vol.26, issue.6, p.146, 2017.

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, p.14, 2015.

F. C. Heinrich, A. Carpen-amarie, and A. Degomme, Predicting the Performance and the Power Consumption of MPI Applications With SimGrid". working paper or preprint, p.121, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01446134

F. C. Heinrich, T. Cornebize, and A. Degomme, Predicting the Energy Consumption of MPI Applications at Scale Using a Single Node, IEEE. Hawaii, p.121, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01523608

S. Herbein, D. H. Ahn, and D. Lipari, Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. HPDC '16, vol.106, p.92, 2016.

H. Pham-tuan, Hiep_PHAM_report_DFS_impact_ on _ HPC _ app _ REPORT . pdf. MA thesis. Univ. Grenbole Alpes, vol.76, p.31, 2018.

B. Hindman, A. Konwinski, and M. Zaharia, Mesos: A Platform for Fine-grained Resource Sharing in the Data Center, Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. NSDI'11, pp.295-308, 2011.

M. Imbert, L. Pouilloux, J. Rouzaud-cornabas, A. Lèbre, and T. Hirofuchi, Using the EXECO toolbox to perform automatic and reproducible cloud experiments, 1st International Workshop on UsiNg and building ClOud Testbeds (UNICO), collocated with IEEE CloudCom, vol.149, p.77, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00861886

N. S. Islam, X. Lu, M. Wasi-ur-rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, p.30, 2015.

M. Douglas, R. Jacobsen, and . Shane-canon, Contain this, unleashing docker for hpc, Proceedings of the Cray User Group, vol.49, p.33, 2015.

I. Jimenez, M. Sevilla, and N. Watkins, The popper convention: Making reproducible systems evaluation practical, Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.1561-1570, 2017.

N. Sidharth, . Kashyap, J. Ade, J. Fewings, and . Davies, Big Data at HPC Wales, vol.74, p.71, 2015.

J. Kepner, W. Arcand, and D. Bestor, Lustre, Hadoop, Accumulo, p.28, 2015.

R. H. Katz, G. A. Gibson, and D. A. Patterson, Disk system architectures for high performance computing, Proceedings of the IEEE 77, vol.12, p.7, 1989.

D. E. Knuth, Literate Programming, The Computer Journal, vol.27, p.138, 1984.

G. M. Kurtzer, V. Sochat, and M. W. Bauer, Singularity: Scientific containers for mobility of compute, PLOS ONE, vol.12, issue.5, p.47, 2017.

M. Gregory, V. Kurtzer, M. W. Sochat, and . Bauer, Singularity: Scientific containers for mobility of compute, PloS one, vol.12, p.33, 2017.

S. Krishnan, M. Tatineni, and C. Baru, myHadoop-Hadoop-on-Demand on traditional HPC resources". In: San Diego Supercomputer Center, vol.70, p.9, 2011.

D. Klusa?ek, ?. Toth, and G. Podoln?kova, Complex Job Scheduling Simulations with Alea 4, Proceedings of the 9th EAI International Conference on Simulation Tools and Techniques. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, pp.124-129, 2016.

S. Lawrence, M. David, G. Pennock, and . William-flake, Persistence of web references in scientific research, In: Computer, vol.2, p.134, 2001.

Z. Henry, J. P. Lo, and . Cohen, Academic Torrents: Scalable Data Distribution, p.147, 2016.

N. Liu, J. Cope, and P. Carns, On the Role of Burst Buffers in Leadership-class Storage Systems, vol.96, p.93, 2012.

N. Liu, X. Yang, X. Sun, J. Jenkins, and R. Ross, YARNsim: Simulating Hadoop YARN, 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, p.93, 2015.

M. Livny, J. Basney, R. Raman, and T. Tannenbaum, Mechanisms for high throughput computing, SPEEDUP journal, vol.11, pp.36-40, 1997.

A. Luckow, I. Paraskevakos, G. Chantzialexiou, and S. Jha, Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management". In: arXiv preprint, 2016.

K. Ma, Situ Visualization at Extreme Scale: Challenges and Opportunities, vol.29, p.16, 2009.

O. Marcu, A. Costan, G. Antoniu, and M. Pérez, Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks, pp.433-442, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01347638

. Bibliography,

R. Mclay, K. W. Schulz, W. L. Barth, and T. Minyard, Best Practices for the Deployment and Management of Production HPC Clusters, State of the Practice Reports. SC '11, vol.9, p.53, 2011.

J. Mambretti, J. Chen, and F. Yeh, Next Generation Clouds, the Chameleon Cloud Testbed, and Software Defined Networking (SDN), Proceedings of the 2015 International Conference on Cloud Computing Research and Innovation (ICCCRI). ICCCRI '15, pp.73-79, 2015.

M. Michael, D. Glesser, Y. Georgiou, and O. Richard, Big Data and HPC collocation: Using HPC idle resources for Big Data Analytics, p.62, 2017.

M. Mercier, A. Faure, and O. Richard, Considering the Development Workflow to Achieve Reproducibility with Variation, SC 2018 -Workshop: ResCuE-HPC. Dallas, United States, p.127, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01891084

Z. Ming, C. Luo, and W. Gao, BDGS: A scalable big data generator suite in big data benchmarking, Workshop on Big Data Benchmarks, p.79, 2013.

E. Molina-estolano, M. Gokhale, and C. Maltzahn, Mixing Hadoop and HPC Workloads on Parallel Filesystems, Proceedings of the 4th Annual Workshop on Petascale Data Storage. PDSW '09, p.29, 2009.

W. C. Moody, L. B. Ngo, E. Duffy, and A. Apon, JUMMP: Job Uninterrupted Maneuverable MapReduce Platform, 2013 IEEE International Conference on Cluster Computing (CLUSTER), p.10, 2013.

, Evaluating the Suitability of Commercial Clouds for NASA's High Performance Computing Applications: A Trade Study, p.33, 2018.

G. Stephen and . Nash, A History of Scientific Computing, 1990.

M. Veiga-neves, T. Ferreto, and C. Rose, Scheduling mapreduce jobs in hpc clusters, Euro-Par 2012 Parallel Processing, p.71, 2012.

B. Nitzberg, J. M. Schopf, and J. Jones, PBS Pro: Grid computing and scheduling attributes, Grid resource management, p.68, 2004.

. Bibliography,

D. Nüst, C. Granell, and B. Hofer, Reproducible research and GIScience: an evaluation using AGILE conference papers, p.136, 2018.

K. Ousterhout, R. Rasti, and S. Ratnasamy, Making Sense of Performance in Data Analytics Frameworks, In: NSDI, vol.15, p.19, 2015.

J. Steven, K. D. Plimpton, and . Devine, MapReduce in MPI for Large-scale Graph Algorithms, Parallel Comput, vol.37, issue.9, p.69, 2011.

B. Peng, B. Zhang, and L. Chen, HarpLDA+: Optimizing latent dirichlet allocation for parallel efficiency, 2017 IEEE International Conference on, p.20, 2017.

J. Antonio-pascual, J. Navaridas, and J. , Effects of Topology-Aware Allocation Policies on Scheduling Performance, Job Scheduling Strategies for Parallel Processing. Ed. by Eitan Frachtenberg and Uwe Schwiegelshohn, p.92, 2009.

K. R. Popper, The Logic of Scientific Discovery. Routledge, 1959 (cit, p.126

R. Priedhorsky and T. Randles, Charliecloud: Unprivileged Containers for User-defined Software Stacks in HPC, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC '17, vol.36, p.48, 2017.

R. Priedhorsky and T. Randles, Charliecloud: Unprivileged containers for user-defined software stacks in hpc, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p.33, 2017.

J. Rattner, Concurrent processing: A new direction in scientific computing, Proc. AFIPS Conf, vol.54, p.7, 1985.

R. Raghavendra, P. Dewan, and M. Srivatsa, Unifying HDFS and GPFS: Enabling Analytics on Software-Defined Storage, Proceedings of the 17th International Middleware Conference. Middleware '16, vol.3, p.30, 2016.

K. Ren, Y. Kwon, M. Balazinska, and B. Howe, Hadoop's Adolescence: An Analysis of Hadoop Usage in Scientific Workloads, Proc. VLDB Endow, vol.6, p.13, 2013.

I. Raicu, I. T. Foster, and P. Beckman, Making a Case for Distributed File Systems at Exascale, Proceedings of the Third International Workshop on Large-scale System and Application Performance. LSAP '11, p.31, 2011.

R. Ricci, G. Wong, and L. Stoller, Apt: A platform for repeatable research in computer science, ACM SIGOPS Operating Systems Review, vol.49, p.132, 2015.

C. Ruiz, E. Jeanvoine, and L. Nussbaum, Performance evaluation of containers for HPC, VHPC -10th Workshop on Virtualization in High-Performance Cloud Computing. VHPC -10th Workshop on Virtualization in High-Performance Cloud Computing, p.33, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01195549

J. L. Reyes-ortiz, L. Oneto, and D. Anguita, Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf, vol.53, pp.121-130, 2015.

J. L. Reyes-ortiz, L. Oneto, and D. Anguita, Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf, p.69, 2015.

J. Renker, S. Schlagkamp, and G. Rinkenauer, Questionnaire for User Habits of Compute Clusters (QUHCC), p.167, 2015.

C. Ruiz, S. Harrache, M. Mercier, and O. Richard, Reconstructable Software Appliances with Kameleon, Operating Systems Review, vol.49, pp.80-89, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01334135

M. S. Al-kahtani, Big Data Networking : Requirements, Architecture and Issues, International Journal of Wireless & Mobile Networks, vol.8, p.32, 2016.

M. Schwarzkopf, A. Konwinski, M. Abd-el-malek, and J. Wilkes, Omega: flexible, scalable schedulers for large compute clusters, SIGOPS European Conference on Computer Systems (EuroSys), p.68, 2013.

M. Sergent, M. Dagrada, and P. Carribault, Efficient Communication/Computation Overlap with MPI+OpenMP Runtimes Collaboration, Parallel Processing -24th International Conference on Parallel and Distributed Computing, p.166, 2018.

L. Stanisic, A. Legrand, and V. Danjean, An Effective Git And Org-Mode Based Workflow For Reproducible Research, SIGOPS Oper. Syst. Rev, vol.49, issue.1, p.142, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01112795

A. Sulistio, U. Cibej, S. Venugopal, B. Robic, and R. Buyya, A toolkit for modelling and simulating data Grids: an extension to GridSim, Concurrency and Computation: Practice and Experience, vol.20, p.154, 2008.

S. Sur, H. Wang, J. Huang, X. Ouyang, and D. Panda, Can high-performance interconnects benefit hadoop distributed file system, Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds (MASVDC). Held in Conjunction with MICRO. Citeseer. 2010 (cit, p.10

W. Tantisiriroj, S. W. Son, and S. Patil, On the duality of data-intensive file system design: Reconciling HDFS and PVFS, SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p.30, 2011.

E. Totoni, A. Todd, T. Anderson, and . Shpeisman, HPAT: high performance analytics with scripting ease-of-use, Proceedings of the International Conference on Supercomputing, p.20, 2017.

P. Troger, H. Rajic, A. Haas, and P. Domagalski, Standardization of an API for Distributed Resource Management Systems, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07), p.166, 2007.

A. Uselton, M. Howison, and N. J. Wright, Parallel I/O performance: From events to ensembles, Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp.1-11, 2010.

V. Kumar-vavilapalli, A. C. Murthy, and C. Douglas, Apache Hadoop YARN: Yet Another Resource Negotiator, Proceedings of the 4th Annual Symposium on Cloud Computing. SOCC '13, vol.5, p.68, 2013.

V. Kumar-vavilapalli, C. Arun, C. Murthy, and . Douglas, Apache hadoop yarn: Yet another resource negotiator, Proceedings of the 4th annual Symposium on Cloud Computing, vol.68, p.9, 2013.

P. Velho, L. Schnorr, H. Casanova, and A. Legrand, On the Validity of Flow-level TCP Network Models for Grid and Cloud Simulations, ACM Transactions on Modeling and Computer Simulation, vol.23, p.93, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00872476

M. Vilayannur, S. Lang, R. Ross, R. Klundt, and L. Ward, Extending the POSIX I/O interface: A parallel file system perspective, p.29, 2008.

L. Wang, J. Zhan, and C. Luo, Bigdatabench: A big data benchmark suite from internet services, High Performance Computer Architecture (HPCA), p.79, 2014.

X. Wang, M. Mubarak, X. Yang, B. Robert, Z. Ross et al., Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), p.154, 2018.

M. Wasi-ur-rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, p.69, 2015.

B. White, J. Lepreau, and L. Stoller, An Integrated Experimental Environment for Distributed Systems and Networks, Proc. of the Fifth Symposium on Operating Systems Design and Implementation". USENIX Association, pp.255-270, 2002.

R. Wurmus, B. Uyar, and B. Osberg, Reproducible genomics analysis pipelines with GNU Guix, p.133, 2018.

L. Xu, . Seung-hwan, M. Lim, A. R. Li, R. Butt et al., Scaling up data-parallel analytics platforms: Linear algebraic operation cases, p.19, 2017.

P. Xuan, J. Denton, P. K. Srimani, R. Ge, and F. Luo, Big Data Analytics on Traditional HPC Infrastructure Using Two-level Storage, Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems. DISCS '15, vol.4, pp.1-4, 2015.

O. Yildiz and S. Ibrahim, On the Performance of Spark on HPC Systems: Towards a Complete Picture, Supercomputing Frontiers, vol.117, p.29, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01742016

A. B. Yoo, M. A. Jette, and M. Grondona, SLURM: Simple Linux Utility for Resource Management, p.68, 2003.

. Bibliography,

M. Zaharia, M. Chowdhury, and T. Das, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, pp.2-2, 2012.

E. C. Francieli-zanon-boito, J. L. Inacio, and . Bez, A Checkpoint of Research on Parallel I/O for High Performance Computing, Computing Surveys, vol.51, issue.2, p.29, 2018.

B. Zhang, Y. Ruan, and J. Qiu, Harp: Collective communication on hadoop, 2015 IEEE International Conference on, vol.19, p.10, 2015.

. Webpages,

A. , Derrick -documentation, p.133, 2019.

. Apache, Software projects built on Mesos, vol.18, 2015.

, Welcome to the Worldwide LHC Computing Grid, 2018.

A. Chu, chu11/magpie. Oct. 2013, vol.16, p.70, 2015.

, Disnix website, NixOS community, p.168, 2019.

, NixOS community. NixOps website, p.168, 2019.

, Reproducible Builds community. Reproducible Build. 2019. URL: https:// reproducible-builds.org/ (visited on Feb, vol.6, p.167, 2019.

L. Courtès, GuixHPC web site, vol.22, p.58, 2018.

L. Courtès and R. Wurmus, Reproducible software deployment for high-performance computing, p.58, 2018.

, Common Workflow Language, p.145, 2018.

. Facebook and . Corona, , p.68, 2012.

D. Feitelson, Parallel Workloads Archive, vol.157, p.11, 2017.

. Feitelson, The Standard Workload Format version 2.2, vol.4, p.157, 2018.

F. Figshare-website, , p.146, 2019.

, Infiniband Trade Fondation. InfiniBand Accelerates the World's Fastest Supercomputer, Two of the Top Five Supercomputers, and 77 Percent of New HPC Systems on the TOP500 List, p.22, 2017.

, Spark current commiters, p.17, 2018.

, Apache Software Fondation. Apache Flink, p.63, 2019.

, Apache Software Fondation. Apache Hadoop, vol.63, p.24, 2019.

, URL: https : / / spark . apache.org/ (visited on, Apache Software Fondation. Apache Spark, vol.101, p.63, 2019.

, Git Large File Storage, p.134, 2018.

, git-annex. git-annex, p.134, 2018.

, Grid5000 wiki -Experiment scripting tutorial, vol.23, 2018.

. Gnu-guile, Guile is a programming language, p.57, 2019.

. Hashicorp, Terraform website, p.168, 2019.

, SaltStack website, p.168, 2019.

J. Hines, ORNL researchers leverage GPU Tensor Cores to deliver unprecedented performance, vol.8, p.31, 2018.

, High Performance Conjugate Gradients, p.12, 2018.

, Intel. intel-hpdd/scheduling-connector-for-hadoop: HPC Adapter for Mapreduce/Yarn(HAM), 2017.

. Intel, Intel® Data Analytics Acceleration Library, p.20, 2018.

, Intel. IntelLabs HPAT, p.20, 2018.

P. Jupyter, Jupyter Project, 2019.

, Icalia Labs. Whales -repostory, p.133, 2019.

A. Legrand, Simgrid Usages, http : / / simgrid . gforge . inria.fr/Usages.html (visited on Aug, vol.12, p.148, 2018.

, NASA. NAS Parallel Benchmarks, vol.79, p.12, 2016.

. Nersc, NERSC-8 / Trinity Benchmarks, p.12, 2018.

. Livermore-oak-ridge-argonne, Coral-2 Benchmarks, p.12, 2018.

. Openstack, OpenStack diskimage-builder web site, vol.23, p.52, 2018.

. Openstack, OpenStack web site, vol.23, p.52, 2018.

. Redhat--openshift, , p.133, 2019.

, Org mode for Emacs, 2019.

T. Prickett-morgan, Teaching Grid Engine To Speak Mesos, vol.16, 2015.

, Exascale Proxy Applications, vol.122, p.12, 2018.

. Gnu-project and . Emacs, , p.145, 2019.

, URL: https : //repo2docker.readthedocs.io (visited on, p.133, 2019.

. Puppet, Puppet website, p.168, 2019.

T. Sterling, Beowulf Breakthroughs: The Path to Commodity Supercomputing, p.7, 2003.

. Stencila and . Dockter, , p.133, 2019.

. Tacc and . Tacc-wrangler-user-guide, , p.23, 2018.

. Top500, Top500 statistics, vol.4, 2018.

, What is Harp-DAAL, vol.20, p.10, 2018.

. U. Univia, , p.166, 2019.

. Wikipedia, . Cern, and . Page, , p.15, 2018.

L. Wiki, Lustre Object Storage Service (OSS), p.28, 2019.

. Wikipedia, DevOps -wikipedia article, p.142, 2019.

. Wikipedia, Internet traffic -Global Internet traffic, 2019.

. Zenodo and . Zenodo-website, , p.146, 2019.

]. .. Wik19c, 2 1.2 Computation capacity growth over the years [top19], p.3

, It is reduced to generate a simulation model, 3. Simulation of this model produces large amounts of data, 4. The simulation data is reduced to extract scientific results, Scientific workflow that includes Big Data and HPC: 1. Lots of data is generated from observation of the real world, vol.2

, 64 4.2 Resource utilization using static resource partitioning approach leads to a waste of resources, BeBiDa system overview: The cluster is capable of dynamically share resources between HPC parallel jobs and Big Data applications

, Resource sharing with adapter. This Figure was taken from

, BeBiDa sharing resource mechanism is based on HPC prolog and epilog, p.75

, Example of one experiment with BeBiDa enabled, p.81

W. Bigdata, Time effectiveness E regarding HPC workload utilization. Minimum is 44%, maximum is 91%, with a mean of 68%, p.82

, Overhead on the HPC workload W HP C in mean waiting time, p.83

, Green (dotted) application takes half of the resources of node-1 and consequently reduce the computation power allocated to the Purple. (3) Yellow takes half of the link bandwidth, Resource sharing example of 3 applications using the parallel task model running in concurrency

, Summary of the differences between HPC and Big Data platforms, p.25

, Summary of software stack provisioning process features and their implementation for the main provisioning tools. Each feature is described more precisely in Section 3

, 74 5.1 HDD and SSD sequential read and write was extracted from recent benchmarks results. HDD-low is the disk capacity with 30% diminished capacity due to file system overhead, Comparison between HPC and Big Data RJMS collaboration approaches