Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality

ABSTRACT
The nearest neighbor problem is the following: Given a set P of n points in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q in X. This fundamental geometric problem is of growing importance as an abstraction of similarity search in areas such as information retrieval and image processing. We will focus on the particularly interesting case of the d-dimensional Euclidean space. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the brute-force algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem where we are only required to produce a point p' whose distance from the query q is within a factor (1+e) of the distance between q and its true nearest neighbor. We will present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a truly sublinear query time (for e>1); and, (b) query time polynomial in d and log n, and only a mildly exponential preprocessing cost O(n/e^d). Further, applying a classical geometric lemma on random projections (for which we give a simpler proof), we obtain the first known algorithm with polynomial preprocessing and query time polynomial in d and log n. Unfortunately, for small e, the latter is a purely theoretical result since the exponent depends on 1/e. On the other hand, experimental results indicate that our first algorithm offers extremely fast running times on real data sets. Its key ingredient is the notion of locality-sensitive hashing which may be of independent interest; in fact, we will give applications to information retrieval, pattern recognition, dynamic closest-pairs, and fast clustering algorithms. Time permitting, we will also touch upon a new abstraction of similarity search that leads to an interesting generalization of the nearest neighbors problem.

(Joint work with Piotr Indyk.)