Nearest Neighbor Search in Google Correlate
Venue
Google (2013)
Publication Year
2013
Authors
Dan Vanderkam, Rob Schonberger, Henry Rowley, Sanjiv Kumar
BibTeX
Abstract
This paper presents the algorithms which power Google Correlate, a tool which finds
web search terms whose popularity over time best matches a user-provided time
series. Correlate was developed to generalize the query-based modeling techniques
pioneered by Google Flu Trends and make them available to end users. Correlate
searches across millions of candidate query time series to find the best matches,
returning results in less than 200 milliseconds. Its feature set and requirements
present unique challenges for Approximate Nearest Neighbor (ANN) search techniques.
In this paper, we present Asymmetric Hashing (AH), the technique used by Correlate,
and show how it can be adapted to the specific needs of the product. We then
develop experiments to test the throughput and recall of Asymmetric Hashing as
compared to a brute-force search. For "full" search vectors, we achieve a 10x
speedup over brute force search while maintaining 97% recall. For search vectors
which contain holdout periods, we achieve a 4x speedup over brute force search,
also with 97% recall.
