Exploring Validity Indices for Clustering Textual Data
Résumé
The goal of any clustering algorithm producing flat partitions of data, is to find both the optimal clustering solution and the optimal number of clusters. One natural way to reach this goal without the need for parameters, is to involve a validity index in a clustering process, which can lead to an objective selection of the optimal number of clusters. In this chapter, we provide two main contributions. Firstly, since validity indices have been mostly studied in a two or three-dimensionnal datasets, we have chosen to evaluate them in a real-world applications, document and word clustering. Secondly, we propose a new context-aware method that aims at enhancing the validity indices usage as stopping criteria in agglomerative algorithms. Experimental results show that the method is a step-forward in using, with more reliability, validity indices as stopping criteria.