Scalable k-NN based text clustering

Alessandro Lulli; Thibault Debatty; Matteo Dell 'Amico; Pietro Michiardi; Laura Ricci

doi:10.1109/BigData.2015.7363845

Communication Dans Un Congrès Année : 2015

Scalable k-NN based text clustering

(1) , (2, 3) , (4) , (3) , (5)

1
2
3
4
5

Alessandro Lulli

Fonction : Auteur

University of Pisa - Università di Pisa

Thibault Debatty

Fonction : Auteur

Royal Military Academy (RMA)

Eurecom [Sophia Antipolis]

Matteo Dell 'Amico

Fonction : Auteur

Symantec Research Labs

Pietro Michiardi

Fonction : Auteur
PersonId : 1084771

Eurecom [Sophia Antipolis]

Laura Ricci

Fonction : Auteur

California Institute of Technology

Résumé

Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.

Domaines

Calcul parallèle, distribué et partagé [cs.DC] Algorithme et structure de données [cs.DS]

Fichier principal

rs-publi-4743.pdf (348.66 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Thibault Debatty : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01525701

Soumis le : lundi 29 mai 2017-21:54:59

Dernière modification le : mercredi 4 janvier 2023-15:22:08

Archivage à long terme le : mercredi 6 septembre 2017-10:01:36

Dates et versions

hal-01525701 , version 1 (29-05-2017)

Identifiants

HAL Id : hal-01525701 , version 1
DOI : 10.1109/BigData.2015.7363845

Citer

Alessandro Lulli, Thibault Debatty, Matteo Dell 'Amico, Pietro Michiardi, Laura Ricci. Scalable k-NN based text clustering. 2015 IEEE International Conference on Big Data, Oct 2015, Santa Clara, United States. pp.958 - 963, ⟨10.1109/BigData.2015.7363845⟩. ⟨hal-01525701⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EURECOM

53 Consultations

227 Téléchargements

Scalable k-NN based text clustering

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager