 # Efficient Algorithms for K-Means Clustering

Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu
 Software

 Technical Overview:
 This is a collection of C++ procedures for performing k-means clustering based on a combination of local search and Lloyd's algorithm (also known as the k-means algorithm). Given any set of k centers Z, for each center z in Z, let V(z) denote its neighborhood, that is, the set of data points for which z is the nearest neighbor. Each stage of Lloyd's algorithm moves every center point z to the centroid of V(z) and then updates V(z) by recomputing the distance from each point to its nearest center. These steps are repeated until convergence. However, Lloyd's algorithm can get stuck in locally minimal solutions that are far from the optimal. For this reason it is common to consider heuristics based on local search, in which centers are swapped in and out of an existing solution (typically at random). Such a swap is accepted if it decreases the average distortion, and otherwise it is ignored. It is also possible to combine these two approaches (Lloyd's algorithm and local search), producing a type of hybrid solution. This program provides a number of different algorithms for doing k-means clustering based on these ideas and combinations. Further information can be found in the software documentation and the above research papers. There are four different procedures for performing k-means, which have been implemented here. The main issue is how the neighbors are computed for each center. Lloyd's: Repeatedly applies Lloyd's algorithm with randomly sampled starting points. Swap: A local search heuristic, which works by performing swaps between existing centers and a set of candidate centers. EZ_Hybrid: A simple hybrid algorithm, which does one swap followed by some number of iterations of Lloyd's. Hybrid: A more complex hybrid of Lloyd's and Swap, which performs some number of swaps followed by some number of iterations of Lloyd's algorithm. To avoid getting trapped in local minima, an approach similar to simulated annealing is included as well.
 Software:
 The current version of the software, Version 1.7.2, is provided in a bundle (see below), which contains all the copyright notice, source code, documentation, and a test program. (This version is functionally identical to version 1.7.) Unfortunately, the documentation is not very extensive at this time. A Makefile has been provided, which runs on Unix for Sun Solaris and Redhat Linux, using the g++ compiler, and I have compiled it under Microsoft Visual Studio .NET. The directory KMLwin32/bin contains a version compiled for Microsoft Windows. This program is distributed under conditions of the GNU General Public License. (View copyright.) Download Currrent Version: (1.0Meg, Release Date: Jan 27, 2010) WinZip: kmlocal-1.7.2.zip Tar + zip/gzip: kmlocal-1.7.2.tar.gz
 Related Publications

 Paper:
 An efficient k-means clustering algorithm: Analysis and implementation, T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, IEEE Trans. Pattern Analysis and Machine Intelligence, 24 (2002), 881-892.
 Abstract:
 In k-means clustering we are given a set of n data points in d-dimensional space and an integer k, and the problem is to determine a set of k points in d-space, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

 Paper:
 A Local Search Approximation Algorithm for k-Means Clustering, T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, Computational Geometry: Theory and Applications, 28 (2004), 89-112.)
 Abstract:
 In k-means clustering we are given a set of n data points in d-dimensional space and an integer k, and the problem is to determine a set of k points to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+ε)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9-ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.