Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution

Khadidja Meguelati 1 Bénédicte Fontez 2 Nadine Hilgert 2 Florent Masseglia 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Clustering with accurate results have become a topic of high interest. Dirichlet Process Mixture (DPM) is a model used for clustering with the advantage of discovering the number of clusters automatically and offering nice properties like, e.g., its potential convergence to the actual clusters in the data. These advantages come at the price of prohibitive response times, which impairs its adoption and makes centralized DPM approaches inefficient. We propose DC-DPM, a parallel clustering solution that gracefully scales to millions of data points while remaining DPM compliant, which is the challenge of distributing this process. Our experiments, on both synthetic and real world data, illustrate the high performance of our approach on millions of data points. The centralized algorithm does not scale and has its limit on 100K data points, where it needs more than 7 hours. In this case, our approach needs less than 30 seconds.
Liste complète des métadonnées

Contributor : Florent Masseglia <>
Submitted on : Wednesday, January 30, 2019 - 10:09:48 AM
Last modification on : Friday, March 1, 2019 - 4:07:08 PM


Files produced by the author(s)



Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia. Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution. SAC: Symposium on Applied Computing, Apr 2019, Limassol, Cyprus. ⟨10.1145/3297280.3297327⟩. ⟨hal-01999453⟩



Record views


Files downloads