Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2018

Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters

Résumé

Join operation is one of the key ones in databases, allowing to cross data from several tables. Two tuples are crossed when they share the same value on some attribute(s). A fuzzy or similarity join combines all pairs of tuples for which the distance is lower than or equal to a prespecified threshold ε from one or several relations. Fuzzy join has been studied by many researchers because its practical application. However, join is the most costly and may even not be possible to compute on large databases. In this paper, we thus propose the optimization for MapReduce algorithms to process fuzzy joins of binary strings using Hamming Distance. In particular we propose to use an extension of Bloom Filters to eliminate the redundant data, reduce the unnecessary comparisons, and avoid the duplicate output. We compare and evaluate analytically the algorithms with a cost model.
Fichier principal
Vignette du fichier
18fuzzieee.pdf (403.13 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01857386 , version 1 (15-08-2018)

Identifiants

Citer

Thi-To-Quyen Tran, Thuong-Cang Phan, Anne Laurent, Laurent D’orazio. Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters. FUZZ-IEEE 2018 - International Conference on Fuzzy Systems, Jul 2018, Rio de Janeiro, Brazil. pp.1-7, ⟨10.1109/FUZZ-IEEE.2018.8491658⟩. ⟨hal-01857386⟩
333 Consultations
338 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More