Scalable K-Means by ranked retrieval
Venue
Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242
Publication Year
2014
Authors
Andrei Broder, Lluis Garcia-Pueyo, Vanja Josifovski, Sergei Vassilvitskii, Srihari Venkatesan
BibTeX
Abstract
The k-means clustering algorithm has a long history and a proven practical
performance, however it does not scale to clustering millions of data points into
thousands of clusters in high dimensional spaces. The main computational bottleneck
is the need to recompute the nearest centroid for every data point at every
iteration, aprohibitive cost when the number of clusters is large. In this paper we
show how to reduce the cost of the k-means algorithm by large factors by adapting
ranked retrieval techniques. Using a combination of heuristics, on two real life
data sets the wall clock time per iteration is reduced from 445 minutes to less
than 4, and from 705 minutes to 1.4, while the clustering quality remains within
0.5% of the k-means quality. The key insight is to invert the process of
point-to-centroid assignment by creating an inverted index over all the points and
then using the current centroids as queries to this index to decide on cluster
membership. In other words, rather than each iteration consisting of "points
picking centroids", each iteration now consists of "centroids picking points". This
is much more efficient, but comes at the cost of leaving some points unassigned to
any centroid. We show experimentally that the number of such points is low and thus
they can be separately assigned once the final centroids are decided. To speed up
the computation we sparsify the centroids by pruning low weight features. Finally,
to further reduce the running time and the number of unassigned points, we propose
a variant of the WAND algorithm that uses the results of the intermediate results
of nearest neighbor computations to improve performance.
