HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Journal articles

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Abstract : Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.
Complete list of metadata

Cited literature [29 references]  Display  Hide  Download

Contributor : Bogdan Nicolae Connect in order to contact the contributor
Submitted on : Thursday, June 1, 2017 - 4:01:24 PM
Last modification on : Monday, February 7, 2022 - 4:06:03 PM
Long-term archiving on: : Wednesday, September 6, 2017 - 7:11:43 PM


Files produced by the author(s)



Bogdan Nicolae, Carlos Costa, Claudia Misale, Kostas Katrinis, Yoonho Park. Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics. IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers, 2017, 28 (6), pp.1663 - 1674. ⟨10.1109/TPDS.2016.2627558⟩. ⟨hal-01531374⟩



Record views


Files downloads