Scalability and Optimisation of GroupBy-Joins in MapReduce Scalability and Optimisation of GroupBy-Joins in MapReduce - Archive ouverte HAL Accéder directement au contenu
Rapport (Rapport De Recherche) Année : 2015

Scalability and Optimisation of GroupBy-Joins in MapReduce Scalability and Optimisation of GroupBy-Joins in MapReduce

Résumé

For over a decade, MapReduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets. In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations.
Fichier principal
Vignette du fichier
rr2015-1.pdf (860.89 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-01138840 , version 1 (02-04-2015)

Identifiants

  • HAL Id : hal-01138840 , version 1

Citer

Mostafa Bamha, Mohamad Al Hajj Hassan. Scalability and Optimisation of GroupBy-Joins in MapReduce Scalability and Optimisation of GroupBy-Joins in MapReduce. [Research Report] LIFO, Université d'Orléans. 2015. ⟨hal-01138840⟩
153 Consultations
108 Téléchargements

Partager

Gmail Facebook X LinkedIn More