Learning to hash: forgiving hash functions and applications Learning to hash: forgiving hash functions and applications
Venue
Data Mining and Knowledge Discovery (2008)
Publication Year
2008
Authors
Shumeet Baluja, Michele Covell
BibTeX
Abstract
The problem of efficiently finding similar items in a large corpus of
high-dimensional data points arises in many real-world tasks, such as music, image,
and video retrieval. Beyond the scaling difficulties that arise with lookups in
large data sets, the complexity in these domains is exacerbated by an imprecise
definition of similarity. In this paper, we describe a method to learn a similarity
function from only weakly labeled positive examples. Once learned, this similarity
function is used as the basis of a hash function to severely constrain the number
of points considered for each lookup. Tested on a large real-world audio dataset,
only a tiny fraction of the points (~0.27%) are ever considered for each lookup. To
increase efficiency, no comparisons in the original high-dimensional space of
points are required. The performance far surpasses, in terms of both efficiency and
accuracy, a state-of-the-art Locality-Sensitive-Hashing-based (LSH) technique for
the same problem and data set.
