Publication Data
Learning to Search Efficiently in High Dimensions
Abstract: High dimensional similarity search in large scale databases
becomes an important challenge due to the advent of Internet. For such applications,
specialized data structures are required to achieve computational efficiency.
Traditional approaches relied on algorithmic constructions that are often data
independent (such as Locality Sensitive Hashing) or weakly dependent (such as kd-trees,
k-means trees). While supervised learning algorithms have been applied to related
problems, those proposed in the literature mainly focused on learning hash codes
optimized for compact embedding of the data rather than search efficiency. Consequently
such an embedding has to be used with linear scan or another search algorithm. Hence
learning to hash does not directly address the search efficiency issue. This paper
considers a new framework that applies supervised learning to directly optimize a data
structure that supports efficient large scale search. Our approach takes both search
quality and computational cost into consideration. Specifically, we learn a boosted
search forest that is optimized using pair-wise similarity labeled examples. The output
of this search forest can be efficiently converted into an inverted indexing data
structure, which can leverage modern text search infrastructure to achieve both
scalability and efficiency. Experimental results show that our approach significantly
outperforms the start-of-the-art learning to hash methods (such as spectral hashing),
as well as state-of-the-art high dimensional search algorithms (such as LSH and k-means
trees).
