Linking and disambiguating entities across heterogeneous RDF graphs

Manel Achichi 1 Zohra Bellahsene 1 Mohamed Ben Ellefi 2 Konstantin Todorov 1
1 FADO - Fuzziness, Alignments, Data & Ontologies
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
Abstract : Establishing identity links across RDF datasets is a central and challenging task on the way to realising the Data Web project. It is well-known that data supplied by different sources can be highly heterogeneous—two entities referring to the same real world object are often described, structured and valued differently, or in a complementary fashion. In this paper, we explore the origins and the multiplicity of data heterogeneity problems, proposing a novel classification that allows to isolate challenges and to position our and future work. Many state-of-the-art data linking approaches rely on sets of discriminative properties, provided by the user or by specialised tools, which, in the lack of knowledge of the nature of the data, do not allow to account automatically for a large number of structural heterogeneities. In addition, similarity measures and thresholds need to be selected and tuned manually or learned by specialised algorithms. We propose a solution covering an important number of heterogeneities, attempting to reduce the user configuration effort, based on: (i) Property filtering, or automatic data cleaning of “problematic” attributes; (ii) Instance profiling allowing to represent each resource by a sub-graph considered relevant for the comparison task; and (iii) Instance vector representation allowing to compare resources. To reduce the false positives rate, we apply a (iv) Post-processing step based on hierarchical clustering and key ranking techniques aiming to disambiguate highly similar, though not identical instances. This pipeline is implemented in Legato—a data linking tool, showing to outperform or to perform as well as state-of-the-art tools on highly heterogeneous and diverse benchmark datasets, yet keeping the user configuration effort low.
Document type :
Journal articles
Complete list of metadatas

Cited literature [25 references]  Display  Hide  Download
Contributor : Konstantin Todorov <>
Submitted on : Monday, January 21, 2019 - 2:57:50 PM
Last modification on : Friday, April 5, 2019 - 1:20:12 AM


Files produced by the author(s)



Manel Achichi, Zohra Bellahsene, Mohamed Ben Ellefi, Konstantin Todorov. Linking and disambiguating entities across heterogeneous RDF graphs. Journal of Web Semantics, Elsevier, 2019, 55, pp.108-121. ⟨10.1016/j.websem.2018.12.003⟩. ⟨hal-01987332⟩



Record views


Files downloads