Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution

Khadidja Meguelati 1 Bénédicte Fontez 2 Nadine Hilgert 2 Florent Masseglia 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Clustering with accurate results have become a topic of high interest. Dirichlet Process Mixture (DPM) is a model used for clustering with the advantage of discovering the number of clusters automatically and offering nice properties like, e.g., its potential convergence to the actual clusters in the data. These advantages come at the price of prohibitive response times, which impairs its adoption and makes centralized DPM approaches inefficient. We propose DC-DPM, a parallel clustering solution that gracefully scales to millions of data points while remaining DPM compliant, which is the challenge of distributing this process. Our experiments, on both synthetic and real world data, illustrate the high performance of our approach on millions of data points. The centralized algorithm does not scale and has its limit on 100K data points, where it needs more than 7 hours. In this case, our approach needs less than 30 seconds.
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01999453
Contributor : Florent Masseglia <>
Submitted on : Wednesday, January 30, 2019 - 10:09:48 AM
Last modification on : Friday, March 1, 2019 - 4:07:08 PM

File

ACM_SigConf_SAC2019.pdf
Files produced by the author(s)

Identifiers

Citation

Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia. Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution. SAC: Symposium on Applied Computing, Apr 2019, Limassol, Cyprus. ⟨10.1145/3297280.3297327⟩. ⟨hal-01999453⟩

Share

Metrics

Record views

83

Files downloads

48