Parallelization of the k-means Algorithm in a Spectral Clustering Chain on CPU-GPU Platforms

k-means is a standard algorithm for clustering data. It constitutes generally the final step in a more complex chain of high quality spectral clustering. However this chain suffers from lack of scalability when addressing large datasets. This can be overcome by applying also the k-means algorithm as a pre-processing task to reduce the input data instances. We describe parallel optimization techniques for the k-means algorithm on CPU and GPU. Experimental results on synthetic dataset illustrate the numerical accuracy and performance of our implementations.


Introduction
Clustering refers to the process that aims at revealing the intrinsic structure of data by automatically grouping data instances into meaningful subsets called clusters. The intra-cluster similarity is supposed to be high while the inter-cluster similarity should be low. It is one of the most important tasks in machine learning and data mining and has numerous applications, such as image segmentation [16], video segmentation [17], document analysis [9], etc.
The k-means algorithm [13] is one of the most widely used clustering methods. It is a distance-based method that can efficiently find convex clusters, but it usually fails to discover non-convex clusters. It also relies on an appropriate selection of initial centroids to avoid being stuck in local minima solutions.
Spectral clustering [14] has gained popularity in the last two decades. Based on graph theory, it embeds data into the eigenspace of graph Laplacian and then performs k-means clustering on the embedding representation. Compared to classical k-means, spectral clustering has many advantages. First, it is able to discover non-convex clusters. Then, it has no problem of initialization and can lead to a global solution. Furthermore, one can exploit the unique "eigengap heuristic" [12] to estimate the number of clusters if the clusters are distinctly separated. Finally, spectral clustering algorithms have the potential to be efficiently implemented on HPC platforms since they require substantial linear algebra computations that can be processed using existing HPC libraries. However, spectral clustering has in general a computational cost of O (N 3 ) where N is the number of data instances [20]. This can be a critical issue when dealing with large-scale applications where N can be of order 10 6 or even larger. To overcome this difficulty, some researchers reduce the computational complexity of spectral clustering through methodological changes, e.g., power iteration clustering [11]. It is also possible to use approximation or summarization techniques so that only a small subset of data is involved in the complex computation, e.g., sparsification [4], Nyström approximation [6], or representatives extraction (using a preliminary k-means step) [20]. Moreover, another powerful way is to accelerate spectral clustering on parallel and distributed architectures, where using CPU-GPU heterogeneous platforms is particularly attractive because it combines the strengths of both processors. Specifically, CPUs are efficient in performing traditional computation tasks and have much more memory space than GPUs, while GPUs provide high performance in mathematically intensive computations.
We are interested in proposing a general CPU-GPU-based implementation of spectral clustering that can address large problems. There are several related studies. Zheng et al. [22] present a parallelization of spectral clustering and implement it on CPU and on GPU separately, but the performance for calculating the affinity matrix remains to be improved and the situation is not considered when a matrix is too large to be loaded into the device memory. Sundaram and Keutzer [17] apply spectral clustering for long term video segmentation on a cluster of GPUs. However, their implementation is dedicated to video segmentation and there is no measurement of speedup. Jin and JaJa [8] present a hybrid implementation of spectral clustering on CPU-GPU platforms for problems with a large number of clusters, but the considered datasets are of medium size and the eigensolver performance appears limited.
In this paper, we consider the parallelization of the processing chain of largescale spectral clustering by combining the use of representatives extraction with hybrid CPU-GPU computing. The main contributions of this paper are optimized implementations on CPU and GPU for the k-means algorithm, which are two steps of the global processing chain of spectral clustering. To our knowledge, most of the existing works related to parallel k-means algorithm on CPU (e.g., [3,10]) and on GPU (e.g., [3,5]) do not consider the issue of numerical accuracy that may occur in the update phase due to the propagation of round-off errors and that can lead to poor clustering quality. In this paper we address both high performance of the algorithm and numerical accuracy in the update phase of k-means clustering.
The remainder of this paper is organized as follows. Section 2 describes the computational chain for spectral clustering. In Sect. 3 we present our parallel implementations of k-means algorithm on CPU and GPU with the related optimizations. The experimental evaluation of our code is then presented in Sect. 4 and we conclude in Sect. 5.

A Computational Chain for Spectral Clustering
Spectral clustering has many slightly different algorithms. Here, we present the main steps of spectral clustering according to [12,14]. Given a set of N data instances of Dim dimensions: x 1 , ..., x N in R Dim that are supposed to be grouped into k c clusters, the three main steps of spectral clustering are the following (see also the right part of Fig. 1): 1. Construct the similarity matrix S. The similarity graph, which can be represented by an N × N similarity matrix S, is used to model the similarity between data instances. ε-neighborhood, k-nearest neighbors, and full connection are three common ways to construct the similarity graph [12]. The first two ways yield typically sparse similarity matrix while the last one generates dense matrix. The degree matrix D is a diagonal matrix that can be easily derived from S with d i = N j=1 s ij . The unnormalized graph Laplacian is defined as L = D − S and can be further normalized as the symmetrix matrix L sym = D −1/2 LD −1/2 [12]. Some other researchers define L sym = D −1/2 SD −1/2 that is normalized from S [14]. 2. Compute the first k c eigenvectors e 1 , ..., e kc of graph Laplacian L sym .
Here, by saying "the first k c eigenvectors", we refer to the eigenvectors corresponding to the k c smallest eigenvalues if graph Laplacian is normalized from L, or the k c largest eigenvalues if normalized from S. Let E denote the N × k c matrix containing the k c eigenvectors as columns, then form the matrix T by normalizing each row of E to 1. 3. Perform final k-means clustering. Each row of T can be considered as the embedding representation in R kc of the original data instance with the same row number. Therefore, performing k-means clustering on the rows of T allows to obtain the k c clusters of original data instances.
It can be seen that spectral clustering involves linear algebra computations, especially in the first two steps. This can be achieved using GPU computing and specifically some highly optimized CUDA libraries provided by NVIDIA, such as cuBLAS, cuSPARSE, cuSOLVER and nvGRAPH [15] or a public domain library like MAGMA [18]. If a matrix is sparse, e.g., the similarity matrix for ε-neighborhood graph or k-nearest neighbors graph, we can use the cuSPARSE library. The cuSOLVER library can be used for eigenvector computations in spectral clustering. Furthermore, the nvGRAPH, a library dedicated to graph analytics, contains an API for spectral clustering. However, the API has two important limits. First, it requires the number of clusters as an input in the configuration of spectral clustering (which, in practice, may be difficult to know in advance). Second, it assumes that the similarity matrix (in CSR format) is already prepared, which can be computationally expensive for a general problem.
In view of the limits identified previously in related studies and in NVIDIA solutions, we propose a strategy for parallelizing the complete spectral clustering chain on CPU-GPU heterogeneous architectures as shown in Fig. 1: The first step of the data flow illustrated in Fig. 1 allows to reduce significantly the volume of N input data, extracting k r representatives via k-means algorithm [20]. Typically, we have k c k r N . Each data instance is associated with the nearest representative. Then the k r representatives are transferred from host to device and spectral clustering is performed on GPU on these representatives to find the k c clusters, taking advantage of the CUDA libraries discussed earlier. In particular, it is possible to use either the cuSOLVER or the nvGRAPH Spectral Clustering API for the computation of eigenvectors. The latter also encapsulates the final k-means clustering step. The clustering result for the k r representatives is transferred from device to host, and finally we obtain the cluster labels of N data instances according to the attachment relationships in the first step.
Moreover, some heuristic methods for the automatic estimation of k c such as [12,19,21] can also be applied by using the eigenpairs calculated with or without the k r representatives approach.

Optimizing Parallel k-means Algorithm
In this section, we present the standard k-means algorithm and then describe our parallel and optimized implementations on CPU and GPU, including the inherent bottlenecks and our optimization methods especially for the step of updating centroids.

k-means Algorithm
The k-means algorithm is a distance-based iterative clustering method. Algorithm 1 describes the main steps. The inputs are supposed to be a dataset containing N instances in Dim dimensions, and the desired number of clusters K. The first step consists in selecting K initial centroids from the dataset, either randomly or in a heuristic way (see [1]). Then the algorithm repeats two routines, ComputeAssign and UpdateCentroids, until reaching the stopping criterion. The ComputeAssign routine computes the distance between each instance and each centroid, where the distances are measured using the Euclidean norm. For each instance, we compare the distances related to different centroids and assign the instance to the nearest centroid. In addition, we track the number of instances that have different assignments (i.e. cluster labels) over two consecutive iterations. The UpdateCentroids routine calculates the means of all instances that are assigned to the same centroid and updates the centroids. The stopping criterion can be either a maximal number of iterations, or a relatively stable result, i.e., when the proportion of data instances that change of label is lower that a predefined tolerance. The outputs are the cluster labels of all data instances.

Parallel Implementations
The parallelization of the k-means algorithm on CPU is achieved by using OpenMP and auto-vectorization and by minimizing cache misses. The GPU code is developed in CUDA. We minimize data transfers between CPU and GPU using pinned memory for fast transfers. Specifically, the data instances to be clustered are transferred from CPU to GPU at the beginning of program, then a series of CUDA kernels and library functions are launched from CPU to perform k-means clustering on GPU, finally the cluster labels are transferred to CPU. For the coalescence of memory access, we need to transfer the transposed matrix of data instances. We also transpose the matrix of centroids on GPU, but the overhead is insignificant since it is typically a small matrix. Moreover, in order to check the stopping criterion, at each iteration we need to transfer to CPU the number of instances that change of label, but the price of this transfer is negligible. Besides, we set the optimal sizes for grids and blocks of threads. The CPU code can be used for the preliminary step that extracts representatives while the GPU code can serve as the third step of the spectral clustering algorithm (see Fig. 1). In both codes, we minimize data storage and access by integrating distances computation and instances assignment into one routine (ComputeAssign).
This ComputeAssign routine exhibits a natural parallelism, leading to a straightforward parallel implementation, both on CPU and GPU, not detailed in this paper. Conversely, the UpdateCentroids routine appears more difficult to be efficiently parallelized and is a source of rounding errors due to reduction operations.
Effect of Rounding Errors. For implementations both on CPU and GPU, when using large datasets and floating-point numbers with single precision (32bits arithmetic), we encountered the problem caused by rounding errors that derive from the finite representation capacity of floating-point numbers in particular when adding two numbers of very different magnitudes. In the Update-Centroids routine, the algorithm needs to calculate the sum of data instances in each cluster and then divide the sum by the number of instances in the cluster. Therefore, when a large number of instances are added together one by one naively, the accumulation of rounding errors that may occur finally deteriorates the clustering quality (see [7] for an illustration of the effect of rounding errors). On the other hand, using double precision (64-bits arithmetic) can reduce the effect of rounding errors to a satisfying level of accuracy in our use case, but the computational cost is higher (see e.g., [2]). To preserve the performance of computing in single precision while minimizing the effect of rounding errors, we developed a two-step method as follows.

Two-
Step Method for UpdateCentroids Routine. We split data instances into a certain number of packages of similar size, then calculate the sum within each package (first step), and compute the sum of all packages (second step). By choosing an appropriate number of packages, we can avoid adding numbers of significantly different magnitudes and obtain satisfactory numerical accuracy. We illustrate hereafter how to efficiently parallelize this method on CPU and GPU.  Suppose that we divide N data instances into P packages and perform reductions in two steps, the CPU implementation code is displayed in Listing 1.1. We use both private and reduction clauses in OpenMP directive on line 5, to parallelize the outer loops of the 2 reduction steps, while inner loops are compliant with the main requirements of auto-vectorization (accessing contiguous array indexes and avoiding divergences) engaged with -O3 compilation flag.
For parallel implementation on GPU, we exploit shared memory, dynamic parallelism and multiple streams to achieve better performance. The Update-Centroids routine is split into two steps: UpdateCent S1 computing the sum of instance values within each package (step 1) and UpdateCent S2 computing the values of new centroids (step 2). As shown in Listing 1.2, by using dynamic parallelism (CUDA threads creating child threads), the host code is simplified to two parent kernel launches. Each parent grid is small and contains only nb of streams threads (one thread per stream). Each thread in UpdateCent S1 Parent kernel processes several packages on its own stream (created on line 42), and launches one child grid per package of data instances (lines 47-57). Each child grid contains nb of instances per package × nb of dimensions per instance working threads, and child grids launched on different streams run concurrently as long as there are sufficient hardware resources in the GPU. This strategy allows to optimize the GPU usage independently of the number and size of packages. Thus, the number of packages is constrained only by the rounding error problem. The cudaStreamDestroy (line 58) ensures that this stream will not be reused to launch other threads, while the parent thread will only end when all of its child threads are finished.
In UpdateCent S1 Child kernel, by using shared memory, the expensive atomicAdd operations are performed by every block instead of every thread, hence are reduced significantly (Listing 1.3, lines 31 and 33). Specifically, threads in the same block calculate the local sum by block size at first, then all the local sums are added together through a few atomicAdd operations.  A similar strategy is used to implement step 2 of our complete solution on GPU. Each thread of the parent grid processes several packages, and creates child grids on its own stream. Each child grid is in charge to update the K × Dim centroid values with the contribution of its package. So, it contains K × Dim threads, each one executing only few operations and one atomicAdd (shared memory is not adapted to and not used in step 2 computations). Again, using dynamic parallelism and multiple streams has allowed to speedup the execution.

Experimental Evaluation
The experiments have been carried out on a server located at CentraleSupelec (Metz campus). This server has two 10-core Intel(R) Xeon(R) Silver 4114 processors at 2.2 GHz, and a NVIDIA GeForce RTX 2080 Ti containing 4352 CUDA cores. The CPU code is compiled with gcc version 7.4.0 (with -O3 flag) to have parallelization with OpenMP, vectorization on AVX units and various optimizations. The GPU code is compiled with CUDA version 10.2. Moreover, to use dynamic parallelism in CUDA (see Sect. 3.2) we need to adopt the separate compilation mode: generating and embedding relocatable device code into the host object, before calling the device linker.
As benchmark, we use a synthetic 4D dataset created in Python. It contains 50 million instances uniformly distributed in 4 convex clusters (12.5 million instances in each cluster). Each cluster has a radius of 9 and the centroids are supposed to be (40, 40, 60, 60), (40, 60, 60, 40), (60, 40, 40, 60) and (60, 60, 40, 40), respectively, in the way that the k-means algorithm would not be sensitive to the initialization of centroids and would not be trapped in local minimum solutions. However, due to the intrinsic errors of generating pseudorandom numbers and the rounding errors of floating-point numbers, it appears the calculated centroids could have a deviation of order 10 −4 from the ideal ones.
In our benchmark, we iterate the algorithm while any data instance is attached to a new centroid (tolerance = 0, see Sect. 3.1). Since the number of iterations on CPU and GPU can vary depending on independent selections of initial centroids and on the numerical precision, we are more interested here in the elapsed time per iteration than the overall execution time. The most important results in our tables are highlighted in boldface.
In Table 1, we evaluate the k-means clustering on CPU by comparing the average numerical error of final centroids and the elapsed time per iteration by varying the number of threads, the arithmetic precision, and the number of packages. The column "Loop" represents the whole of two k-means routines. We observe that using a certain number of packages in the UpdataCentroids routine reduces the numerical error in single precision. In our case, using 100 packages is enough for achieving the same numerical accuracy as in double precision. Using single precision instead of double precision decreases the elapsed time.
We give in Table 2 the accuracy and performance results of k-means clustering on GPU. Using packages reduces the effect of rounding errors, and this reduction is enhanced by using the shared memory that allows initial local reductions. The routine UpdateCentroids is the most time-consuming routine on GPU while ComputeAssign represents a small proportion of the runtime. In our GPU implementation, we optimize the configuration of grids and blocks of threads. Table 3 shows an example of how block configuration affects the performance where we set BLOCK SIZE Y (BSYD in listings) to 4 (the number of dimensions of the synthetic data). Note that the centroids initialization and most of data transfers are performed only one time, hence their impact on the whole runtime decreases with the number of iterations. The elapsed time for regular transpositions of some small data appears negligible ( Table 2).   Table 4 demonstrates the impact of GPU optimization on the running time of UpdateCentroids. Compared to the naïve implementation with many atomi-cAdd operations, using shared memory reduces significantly the execution time for different number of packages. The dynamic parallelism also improves the performance in the case of 100 packages and 1000 packages but it degrades the performance for 10000 packages. This is because the GPU hardware resources are not fully concurrently exploited when there are a large number of small packages to be processed on the default stream. Therefore, introducing multiple streams could contribute to the concurrent use of hardware resources and consequently reduce the execution time, which is clearly demonstrated in the case of 10000 packages. The combined use of dynamic parallelism, shared memory and streams achieves very good performances for a general number of packages.
The speedups for the two routines of k-means and the resulting full iteration are displayed in Table 5. For the k-means loop, the best speedup obtained (compared with the sequential implementation) is about ×10 on CPU using 40 logical cores and almost ×50 on GPU (which is 5 times faster than on CPU using 40 logical cores). For the ComputeAssign routine we achieve much higher speedups (around ×300) on GPU than on CPU while the speedups for the UpdateCentroids routine are similar on CPU and GPU.

Conclusion and Future Work
We have proposed parallel implementations on CPU and GPU for the k-means algorithm, which is a key component of the computational chain for spectral clustering on CPU-GPU heterogeneous platforms. We have addressed via a two-step reduction the numerical accuracy issue that may occur in the phase of updating centroids due to the effect of rounding errors. Our GPU implementation employs dynamic parallelism, shared memory and streams to achieve optimal performance for updating centroids. Experiments on a synthetic dataset demonstrate both numerical accuracy and parallelization efficiency of our implementations.
In this paper we have used only a synthetic dataset but as future work we plan to evaluate our parallel k-means algorithms on real-world datasets and compare our implementation with other existing ones. In particular we expect to obtain higher speedups in high-dimensional datasets or those containing a large number of clusters, where the phase of computing the distances is more significant.