On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2015

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Olivier Curé
Hubert Naacke
Mohamed-Amine Baazizi
Bernd Amann

Résumé

Querying very large RDF data sets in an efficient and scalable manner requires parallel query plans combined with appropriate data distribution strategies. Several innovative solutions have recently been proposed for optimizing data distribution with or without predefined query workloads. This paper presents an in-depth analysis and experimental comparison of five representative RDF data distribution approaches. For achieving fair experimental results, we are using Apache Spark as a common parallel computing framework by rewriting the concerned algorithms using the Spark API. Spark provides guarantees in terms of fault tolerance, high availability and scalability which are essential in such systems. Our different implementations aim to highlight the fundamental implementation-independent characteristics of each approach in terms of data preparation, load balancing, data replication and to some extent to query answering cost and performance. The presented measures are obtained by testing each system on one synthetic and one real-world data set over query workloads with differing characteristics and different partitioning constraints.

Dates et versions

hal-01214902 , version 1 (13-10-2015)

Identifiants

Citer

Olivier Curé, Hubert Naacke, Mohamed-Amine Baazizi, Bernd Amann. On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark. The 11th International Workshop on Scalable Semantic Web Knowledge Base Systems, Oct 2015, Bethlehem, Pennsylvania, United States. pp.16-31. ⟨hal-01214902⟩
277 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More