Scalable Iterative Graph Duplicate Detection

Melanie Herschel; Felix Naumann; Sascha Szott; Maik Taubert

Journal Articles IEEE Transactions on Knowledge and Data Engineering Year : 2012

Scalable Iterative Graph Duplicate Detection

(1, 2, 3) , (4) , (5) , (6)

1
2
3
4
5
6

Melanie Herschel

Function : Author
PersonId : 928516

Laboratoire de Recherche en Informatique

Database optimizations and architectures for complex large data

Institut Wilhelm Schickard

Felix Naumann

Function : Author

Hasso Plattner Institute for Software Systems Engineering

Sascha Szott

Function : Author

Konrad-Zuse-Zentrum für Informationstechnik Berlin

Maik Taubert

Function : Author

Biotronik SE & Co. KG, Berlin

Abstract

Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale-up duplicate detection in graph data (ddg) to large amounts of data and pairwise comparisons, using the support of a relational database management system. To this end, we first present a framework that generalizes the ddg process. We then present algorithms to scale ddg in space (amount of data processed with bounded main memory) and in time. Finally, we extend our framework to allow batched and parallel ddg, thus further improving efficiency. Experiments on data of up to two orders of magnitude larger than data considered so far in ddg show that our methods achieve the goal of scaling ddg to large volumes of data.

Domains

Databases [cs.DB]

Fichier principal

TKDE2012a_herschel.pdf (1.35 Mo)

Origin : Publisher files allowed on an open archive

Melanie Herschel : Connect in order to contact the contributor

https://inria.hal.science/hal-00757604

Submitted on : Tuesday, November 27, 2012-11:53:44 AM

Last modification on : Friday, May 24, 2024-9:24:06 AM

Long-term archiving on: Thursday, February 28, 2013-3:43:30 AM

Dates and versions

hal-00757604 , version 1 (27-11-2012)

Identifiers

HAL Id : hal-00757604 , version 1

Cite

Melanie Herschel, Felix Naumann, Sascha Szott, Maik Taubert. Scalable Iterative Graph Duplicate Detection. IEEE Transactions on Knowledge and Data Engineering, 2012, 24 (11), pp.2094-2108. ⟨hal-00757604⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS CNRS INRIA UMR8623 INRIA2 LRI-LAHDAK UNIV-PARIS-SACLAY

227 View

1100 Download

Scalable Iterative Graph Duplicate Detection

Abstract

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Share