Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters

Abstract : Join operation is one of the key ones in databases, allowing to cross data from several tables. Two tuples are crossed when they share the same value on some attribute(s). A fuzzy or similarity join combines all pairs of tuples for which the distance is lower than or equal to a prespecified threshold ε from one or several relations. Fuzzy join has been studied by many researchers because its practical application. However, join is the most costly and may even not be possible to compute on large databases. In this paper, we thus propose the optimization for MapReduce algorithms to process fuzzy joins of binary strings using Hamming Distance. In particular we propose to use an extension of Bloom Filters to eliminate the redundant data, reduce the unnecessary comparisons, and avoid the duplicate output. We compare and evaluate analytically the algorithms with a cost model.
Type de document :
Communication dans un congrès
FUZZ-IEEE: International Conference on Fuzzy Systems, Jul 2018, Rio de Janeiro, Brazil. IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2018
Liste complète des métadonnées

Littérature citée [10 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01857386
Contributeur : Laurent D'Orazio <>
Soumis le : mercredi 15 août 2018 - 18:47:54
Dernière modification le : vendredi 16 novembre 2018 - 01:28:19
Document(s) archivé(s) le : vendredi 16 novembre 2018 - 12:52:22

Fichier

18fuzzieee.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01857386, version 1

Citation

Thi-To-Quyen Tran, Thuong-Cang Phan, Anne Laurent, Laurent D’orazio. Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters. FUZZ-IEEE: International Conference on Fuzzy Systems, Jul 2018, Rio de Janeiro, Brazil. IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2018. 〈hal-01857386〉

Partager

Métriques

Consultations de la notice

87

Téléchargements de fichiers

41