Semi-join Computation on Distributed File Systems Using Map-Reduce-Merge Model

Abstract : Semi-join is the most used technique to optimize the treatment of complex relational queries on distributed architectures. However, the overhead related to semi-join computation can be very high due to data skew and to the high cost of communication in distributed architectures. Internet search engines needs to process vast amounts of raw data every day. Hence, systems that manage such data should assure scalability, reliability and availability issues with reasonable query processing time. Hadoop and Google's File System are examples of such systems. In this paper, we present a new algorithm based on Map-Reduce-Merge model and distributed histograms for processing semi-join operations on such systems. A cost analysis of this algorithm shows that our approach is insensitive to data skew while reducing communication and disk Input/Output costs to a minimum.
Document type :
Conference papers
Liste complète des métadonnées
Contributor : Mostafa Bamha <>
Submitted on : Monday, March 1, 2010 - 9:19:24 PM
Last modification on : Thursday, January 17, 2019 - 3:06:04 PM


  • HAL Id : hal-00460665, version 1



Mohamad Al Hajj Hassan, Mostafa Bamha. Semi-join Computation on Distributed File Systems Using Map-Reduce-Merge Model. (SAC'2010), Mar 2010, Sierre, Switzerland. ACM Press, pp.406-413, 2010. 〈hal-00460665〉



Record views