MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets

Abstract : Mining frequent itemsets in large datasets has received much attention in recent years relying on MapReduce programming model. For instance, many efficient Frequent Itemset Mining (a.k.a. FIM) algorithms have been parallelized to MapReduce principle such as Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most approaches focus on job partitioning and/or load balancing without considering the extensibility depending on required memory assumptions. Thus, a challenge in designing parallel FIM algorithms consists therefore in finding ways to guarantee that data structures used during the mining process always fit in the local memory of processing nodes during all computation steps. In this paper, we propose MapFIM+, a two-phase approach to frequent itemset mining in very large datasets benefiting both from a MapReduce based distributed Apriori method and local in-memory FIM methods. In our approach, MapReduce is first used to generate frequent itemsets until getting local memory-fitted prefix-projected databases, and an optimized local in-memory mining process is then launched to generate all remaining frequent itemsets from each prefix-projected database on individual processing nodes. Indeed, MapFIM+ improves our previous algorithm MapFIM by using an exact evaluation of prefix-projected database sizes during the MapReduce phase. This improvement makes MapFIM+ more efficient, especially for databases leading to huge candidate sets, by significantly reducing communication and disk I/O costs. Performance evaluation shows that MapFIM+ is more efficient and more extensible than existing MapReduce based frequent itemset mining approaches.
Document type :
Journal articles
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01934662
Contributor : Mostafa Bamha <>
Submitted on : Monday, November 26, 2018 - 10:26:22 AM
Last modification on : Thursday, February 7, 2019 - 4:56:54 PM

Identifiers

  • HAL Id : hal-01934662, version 1

Citation

Khanh-Chuong Duong, Mostafa Bamha, Arnaud Giacometti, D. Li Haoyuan, Arnaud Soulet, et al.. MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets. Transactions on Large-Scale Data- and Knowledge-Centered Systems, Springer Berlin / Heidelberg, 2018, Transactions on Large-Scale Data- and Knowledge-Centered Systems: Special Issue on Database- and Expert-Systems Applications, 39, ⟨https://doi.org/10.1007/978-3-662-58415-6_7⟩. ⟨hal-01934662⟩

Share

Metrics

Record views

30