SPARQL query processing with Apache Spark

Abstract : The number and the size of linked open data graphs keep growing at a fast pace and confronts semantic RDF services with problems characterized as Big data. Distributed query processing is one of them and needs to be eciently ad- dressed with execution guaranteeing scalability, high avail- ability and fault tolerance. RDF data management sys- tems requiring these properties are rarely built from scratch but are rather designed on top of an existing engine. In this work, we consider the processing of SPARQL queries with the current state of the art cluster computing engine, namely Apache Spark. We propose and compare ve dif- ferent query processing approaches based on di erent join execution models and Spark components. A detailed exper- imentation on real-world and synthetic data sets promotes two new approaches tailored for the RDF data model which outperform (by a factor of up to 2.4 on query execution time compared to a state of the art distributed SPARQL process- ing engine) the other ones on all major query shapes, i.e., star, snow ake, chain and their composition.
Type de document :
Pré-publication, Document de travail
2016
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01447387
Contributeur : Bernd Amann <>
Soumis le : jeudi 26 janvier 2017 - 18:06:56
Dernière modification le : jeudi 5 juillet 2018 - 14:45:54

Lien texte intégral

Identifiants

  • HAL Id : hal-01447387, version 1
  • ARXIV : 1604.08903

Citation

Hubert Naacke, Olivier Curé, Bernd Amann. SPARQL query processing with Apache Spark. 2016. 〈hal-01447387〉

Partager

Métriques

Consultations de la notice

1852