Skip to Main content Skip to Navigation
Conference papers

Scalable k-NN based text clustering

Abstract : Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.
Complete list of metadata

Cited literature [31 references]  Display  Hide  Download
Contributor : Thibault Debatty Connect in order to contact the contributor
Submitted on : Monday, May 29, 2017 - 9:54:59 PM
Last modification on : Friday, July 26, 2019 - 11:56:02 AM
Long-term archiving on: : Wednesday, September 6, 2017 - 10:01:36 AM


Files produced by the author(s)




Alessandro Lulli, Thibault Debatty, Matteo Dell 'Amico, Pietro Michiardi, Laura Ricci. Scalable k-NN based text clustering. 2015 IEEE International Conference on Big Data, Oct 2015, Santa Clara, United States. pp.958 - 963, ⟨10.1109/BigData.2015.7363845⟩. ⟨hal-01525701⟩



Les métriques sont temporairement indisponibles