Assessing the Quality of RDF Mappings with EvaMap

. Linked Data (LD) is a set of best practices to publish reusable data on the web in RDF format. Despite the beneﬁts of LD, many datasets are not published as RDF. Transforming structured datasets into RDF datasets is possible thanks to RDF Mappings. But, for the same dataset, diﬀerent mappings can be proposed. We believe that a tool capable of evaluating the quality of an RDF mapping would make the creation of mappings easier. In this paper, we present EvaMap, a framework to assess the quality of RDF mappings. The demonstration shows how EvaMap can be used to evaluate and improve RDF mappings.


Introduction and Motivation
Linked Data (LD) is a set of best practices to publish reusable data on the web in RDF format.Despite the benefits of LD, many datasets are not published as RDF.Transforming structured datasets into RDF datasets is possible thanks to RDF Mappings.
An RDF mapping consists in a set of rules that map data from an input dataset to RDF triples.Languages like R2RML 3 and RML 4 are widely used to define machine-readable mappings.In this work, we use YARRRML, a humanreadable representation of RDF mappings.
Making a relevant RDF mapping for a dataset is a challenging task because it requires to answer several questions: 1. What are the different resources described in the dataset (e.g., cars, persons, cities, places, etc.)? 2. What are the attributes of these resources (e.g., price, age, etc.)? 3. How should the IRI of resources be defined?4. What are the possible relations between the different resources (e.g., the city is the birthplace of the person)? 5. Which ontology, classes, and properties should be used?
In addition to possible errors by the user, different answers are possible for some of these questions and, thus, different RDF mappings are possible for the same dataset.
For example, Figure 1 represents two possible mappings for the dataset in Table 1.Unlike mapping 1(a), mapping 1(b) does not include a class description in resource IRIs and does not reference the Birth Province column.Given a structured dataset, how to help users to create RDF mappings without errors automatically, and how to choose the best mapping from a set of RDF mappings?
We believe that a tool capable of evaluating the quality of an RDF mapping would make the creation and the choice of RDF mappings easier.[1] proposes a framework that assesses and refines RML mappings.However, authors focus on logical errors due to incorrect usage of ontologies (e.g., violation of domain, range, disjoin classes, etc.).[3] proposes a framework to assess the quality of RDF datasets through metrics.Metrics are organized in dimensions evaluating different aspects of a dataset (e.g., availability, interlinking, etc.).But, [3] does not propose to assess the quality of an RDF mapping.In our work, like in [1], we evaluate metrics on the RDF mapping instead of on the resulting RDF dataset.This choice allows us to identify errors at the beginning of the publishing process and saves time.
Based on the framework proposed in [3], we propose EvaMap.EvaMap is a framework to Evaluate RDF Mappings.The goal is to control the quality of the resulting dataset through its mapping without having to generate the RDF dataset.
2 EvaMap: A Framework to Evaluate RDF Mappings EvaMap uses a set of metrics organized in 7 dimensions.Each metric is evaluated on the RDF mapping or on the resulting RDF dataset when instances are needed.For example, the available resource IRIs metric needs RDF dataset to check if generated IRIs are dereferenceable.In this case, EvaMap generates a sample such that applying each mapping rule to the entire input dataset is not necessary.Table 2 describes each dimension of EvaMap.These dimensions are based on [3].From these dimensions, we propose the Coverability one that detects the lose of data between the input dataset and the resulting RDF dataset.We also introduce four new metrics described in Table 3.In order to compute the quality of a mapping, M i applied on a raw dataset D, we propose a function q(M i , D) ∈ [0, 1] that is the weighted mean of the quality of each metric m j (M i , D): We use the same function to compute the score for a specific dimension.To do that, we only consider the subset of metrics for the corresponding dimension.
Weights w j associated with metrics can be used to give more or less importance to each metric.For example, the user does not always want to generate RDF triples for all data in the input dataset.Thus, weights associated with coverability metrics can be lowered or set to zero.

Demonstration
We implemented EvaMap to evaluate YARRRML [2] mappings for datasets of the OpenDataSoft's data network5 .Our tool is available as a web service at https://evamap.herokuapp.com/.The source code of our tool6 and web service7 are available on GitHub under the MIT license.
During the demonstration, attendees will be able to select different mappings and use EvaMap to compare them.For each mapping, the global quality score will be computed as well as the quality score for each dimension.Our tool will also give feedback to improve RDF mapping.
In our tool, users can assess two mappings for the dataset football-ligue.Users can see that the mapping football-ligue obtains a worse global score than the mapping football-ligue-fixed.In the detailed report, users can analyze by dimension why these scores are different.

Fig. 1 .
Fig. 1.Two RDF mappings for the Roman emperors dataset.Bold text starting with $ are references to a column in the dataset.

Table 1 .
Excerpt from a structured dataset describing Roman emperors.

Table 2 .
Connectability Checks if links exist between local and external resourcesCoverability Checks if the RDF mapping is exhaustive compared to the initial dataset Dimensions used by EvaMap.

Table 3 .
New metrics proposed in EvaMap.