Learning to Search Efficiently in High Dimensions
Venue
Neural Information Processing Systems (2011)
Publication Year
2011
Authors
Zhen Li, Huazhong Ning, Liangliang Cao, Tong Zhan, Yihong Gong, Thomas S. Huang
BibTeX
Abstract
High dimensional similarity search in large scale databases becomes an important
challenge due to the advent of Internet. For such applications, specialized data
structures are required to achieve computational efficiency. Traditional approaches
relied on algorithmic constructions that are often data independent (such as
Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means trees).
While supervised learning algorithms have been applied to related problems, those
proposed in the literature mainly focused on learning hash codes optimized for
compact embedding of the data rather than search efficiency. Consequently such an
embedding has to be used with linear scan or another search algorithm. Hence
learning to hash does not directly address the search efficiency issue. This paper
considers a new framework that applies supervised learning to directly optimize a
data structure that supports efficient large scale search. Our approach takes both
search quality and computational cost into consideration. Specifically, we learn a
boosted search forest that is optimized using pair-wise similarity labeled
examples. The output of this search forest can be efficiently converted into an
inverted indexing data structure, which can leverage modern text search
infrastructure to achieve both scalability and efficiency. Experimental results show
that our approach significantly outperforms the start-of-the-art learning to hash
methods (such as spectral hashing), as well as state-of-the-art high dimensional
search algorithms (such as LSH and k-means trees).
