Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Bogdan Nicolae; Carlos Costa; Claudia Misale; Kostas Katrinis; Yoonho Park

doi:10.1109/TPDS.2016.2627558

Journal Articles IEEE Transactions on Parallel and Distributed Systems Year : 2017

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

(1) , (2) , (3) , (1) , (2)

1
2
3

Bogdan Nicolae

Function : Correspondent author
PersonId : 21945
IdHAL : bnicolae
ORCID : 0000-0002-0661-7509

Connectez-vous pour contacter l'auteur

IBM Research - Ireland

Carlos Costa

Function : Author

IBM T.J. Watson Research Center

Claudia Misale

Function : Author

Università degli studi di Torino = University of Turin

Kostas Katrinis

Function : Author

IBM Research - Ireland

Yoonho Park

Function : Author

IBM T.J. Watson Research Center

Abstract

Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.

Keywords

distributed systems big data analytics Spark data shuffling scalable I/O memory efficient data transformations I/O load balancing elastic buffering

Domains

Distributed, Parallel, and Cluster Computing [cs.DC]

Fichier principal

tpds.pdf (2.66 Mo)

Origin : Files produced by the author(s)

Bogdan Nicolae : Connect in order to contact the contributor

https://inria.hal.science/hal-01531374

Submitted on : Thursday, June 1, 2017-4:01:24 PM

Last modification on : Tuesday, January 9, 2024-12:34:04 PM

Long-term archiving on: Wednesday, September 6, 2017-7:11:43 PM

Dates and versions

hal-01531374 , version 1 (01-06-2017)

Identifiers

HAL Id : hal-01531374 , version 1
DOI : 10.1109/TPDS.2016.2627558

Cite

Bogdan Nicolae, Carlos Costa, Claudia Misale, Kostas Katrinis, Yoonho Park. Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics. IEEE Transactions on Parallel and Distributed Systems, 2017, 28 (6), pp.1663 - 1674. ⟨10.1109/TPDS.2016.2627558⟩. ⟨hal-01531374⟩

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Abstract

Keywords

Domains

Dates and versions

Identifiers

Cite

Export

Altmetric

Share