Pipelined parallelism for multi-join queries on shared nothing machines
Résumé
The development of scalable parallel database systems requires the design of efficient algorithms especially for the join which is the most frequent and expensive operation in relational database systems. Join is also the most vulnerable operation to data skew and to the high cost of communication in distributed architectures. Moreover, for multi-join queries, the problem of data-skew is more complicated because the imbalance of intermediate results is unknown during static query optimization. In this paper, we show that the join algorithms we presented in our earlier papers, can be applied efficiently in various parallel execution strategies making it possible to exploit not only intra-operator parallelism but also inter-operator parallelism. These algorithms reduce the communication and synchronization costs to a minimum while guaranteeing a perfect load balancing during all the stages of join computation even for highly skewed data.