Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model

Mohamad Al Hajj Hassan; Mostafa Bamha

Communication Dans Un Congrès Année : 2015

Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model

(1) , (2)

1
2

Mohamad Al Hajj Hassan

Fonction : Auteur

Lebanese International University

Mostafa Bamha

Fonction : Auteur
PersonId : 952859

Laboratoire d'Informatique Fondamentale d'Orléans

Résumé

For over a decade, MapReduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets. In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations.

Mots clés

Join and GrouBy-join operations Data skew MapReduce programming model Distributed file systems Hadoop framework Apache Pig Latin

Domaines

Informatique

Mostafa Bamha : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01160931

Soumis le : lundi 8 juin 2015-12:31:58

Dernière modification le : samedi 25 juin 2022-10:12:44

Dates et versions

hal-01160931 , version 1 (08-06-2015)

Identifiants

HAL Id : hal-01160931 , version 1

Citer

Mohamad Al Hajj Hassan, Mostafa Bamha. Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model. International Conference On Computational Science - ICCS 2015, Jun 2015, Reykjavik, Iceland. pp.70-79. ⟨hal-01160931⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-ORLEANS MSL MSL-THESE

62 Consultations

0 Téléchargements

Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager